Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] TransformIT testStopWaitForCheckpoint fails #63365

Closed
imotov opened this issue Oct 6, 2020 · 5 comments · Fixed by #63657 or #64865
Closed

[CI] TransformIT testStopWaitForCheckpoint fails #63365

imotov opened this issue Oct 6, 2020 · 5 comments · Fixed by #63657 or #64865
Assignees
Labels
:ml/Transform Transform >test-failure Triaged test failures from CI

Comments

@imotov
Copy link
Contributor

imotov commented Oct 6, 2020

Build scan: https://gradle-enterprise.elastic.co/s/nd6tm5enco4em

Repro line:

REPRODUCE WITH: ./gradlew ':x-pack:plugin:transform:qa:multi-node-tests:javaRestTest' --tests "org.elasticsearch.xpack.transform.integration.TransformIT.testStopWaitForCheckpoint" \
  -Dtests.seed=B70E548D9A2EA63B \
  -Dtests.security.manager=true \
  -Dtests.locale=ja-JP \
  -Dtests.timezone=America/Indiana/Vincennes \
  -Druntime.java=11

Reproduces locally?: No

Applicable branches: master

Failure history:
Failed 7 times in the last week, several times today, recent failures:

Failure excerpt:

org.elasticsearch.ElasticsearchStatusException: Elasticsearch exception [type=status_exception, reason=Cannot start transform [transform-wait-for-checkpoint] as it is already started.]
	at __randomizedtesting.SeedInfo.seed([B70E548D9A2EA63B:CADB78E656BA4C0D]:0)
	at org.elasticsearch.rest.BytesRestResponse.errorFromXContent(BytesRestResponse.java:187)
	at org.elasticsearch.client.RestHighLevelClient.parseEntity(RestHighLevelClient.java:1907)
	at org.elasticsearch.client.RestHighLevelClient.parseResponseException(RestHighLevelClient.java:1884)
	at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1641)
	at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1613)
	at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1580)
	at org.elasticsearch.client.TransformClient.startTransform(TransformClient.java:277)
	at org.elasticsearch.xpack.transform.integration.TransformIntegTestCase.startTransformWithRetryOnConflict(TransformIntegTestCase.java:150)
	at org.elasticsearch.xpack.transform.integration.TransformIT.testStopWaitForCheckpoint(TransformIT.java:269)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
	at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
	at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
	at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
	at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:824)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:475)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
	at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
	at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
	at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
	at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:831)
	at java.base/java.lang.Thread.run(Thread.java:834)
@imotov imotov added >test-failure Triaged test failures from CI :ml/Transform Transform labels Oct 6, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml/Transform)

@hendrikmuhs
Copy link

hendrikmuhs commented Oct 7, 2020

This looks like #62204. The test uses a workaround, but for some reason the workaround does not get executed. Unfortunately the traces miss important information.

I will add extra logging.

hendrikmuhs pushed a commit to hendrikmuhs/elasticsearch that referenced this issue Oct 7, 2020
hendrikmuhs pushed a commit that referenced this issue Oct 7, 2020
hendrikmuhs pushed a commit that referenced this issue Oct 7, 2020
@hendrikmuhs
Copy link

1 more failure after adding logging:

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-darwin-compatibility/279/console

Logging proofs that it retried 10 times, however we wait only 5ms between each call, which indicates that this isn't enough on a slow VM.

hendrikmuhs pushed a commit that referenced this issue Oct 15, 2020
increase the overall timeout by increasing the wait time after every retry.

fixes #63365
hendrikmuhs pushed a commit that referenced this issue Oct 15, 2020
increase the overall timeout by increasing the wait time after every retry.

fixes #63365
hendrikmuhs pushed a commit that referenced this issue Oct 15, 2020
increase the overall timeout by increasing the wait time after every retry.

fixes #63365
@ywangd
Copy link
Member

ywangd commented Oct 29, 2020

I am re-openning this issue since it just failed again: https://gradle-enterprise.elastic.co/s/55ioilnsyjv2u

It fails occationally since the fix (2020-10-16): Build stats

@ywangd ywangd reopened this Oct 29, 2020
@hendrikmuhs
Copy link

The errors seem to originate from darwin instances which are known to be slow(#58286). It seems the increased timeout in #63657 is not enough.

I will further increase the timeout and improve the assertion to make it more clear when we run into a timeout vs. run into some other error.

@hendrikmuhs hendrikmuhs self-assigned this Nov 10, 2020
hendrikmuhs pushed a commit to hendrikmuhs/elasticsearch that referenced this issue Nov 10, 2020
hendrikmuhs pushed a commit that referenced this issue Nov 11, 2020
…StopWaitForCheckpoint (#64865)

improve timeout when starting a transform with retry, improve transform
ids in tests to avoid clashes in logs

fixes #63365
hendrikmuhs pushed a commit that referenced this issue Nov 11, 2020
…StopWaitForCheckpoint (#64865)

improve timeout when starting a transform with retry, improve transform
ids in tests to avoid clashes in logs

fixes #63365
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml/Transform Transform >test-failure Triaged test failures from CI
Projects
None yet
4 participants