Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Unrecognized VM option 'UseConcMarkSweepGC' in BWC tests for g1gc builds #62716

Closed
droberts195 opened this issue Sep 21, 2020 · 23 comments
Closed
Assignees
Labels
:Delivery/Build Build or test infrastructure Team:Delivery Meta label for Delivery team >test-failure Triaged test failures from CI

Comments

@droberts195
Copy link
Contributor

Build scan:

https://gradle-enterprise.elastic.co/s/7zowbgz6besag

Repro line:

There isn't one in the failure message. I thought this might reproduce it but it doesn't:

./gradlew :qa:repository-multi-version:v7.10.0#bwcTest

Reproduces locally?:

No

Applicable branches:

master

Failure history:

Hard to search for, as the build stats record 0 test failures for this failure scenario.

Failure excerpt:

> Task :qa:full-cluster-restart:v7.10.0#oldClusterTest FAILED
Exec output and error:
| Output for ./bin/elasticsearch-keystore:Unrecognized VM option 'UseConcMarkSweepGC'
| Error: Could not create the Java Virtual Machine.
| Error: A fatal exception has occurred. Program will exit.

> Task :qa:repository-multi-version:v7.10.0#Step1OldClusterTest FAILED
Exec output and error:
| Output for ./bin/elasticsearch-keystore:Unrecognized VM option 'UseConcMarkSweepGC'
| Error: Could not create the Java Virtual Machine.
| Error: A fatal exception has occurred. Program will exit.

> Task :qa:verify-version-constants:v7.10.0#integTest FAILED
Exec output and error:
| Output for ./bin/elasticsearch-keystore:Unrecognized VM option 'UseConcMarkSweepGC'
| Error: Could not create the Java Virtual Machine.
| Error: A fatal exception has occurred. Program will exit.

I am guessing this is happening because we are now using Java 15 as the bundled JDK for 7.x, and certain tools used during CI runs always run with the bundled JDK rather than the runtime JDK reported by Gradle. Then presumably the runtime options selected for the runtime JDK reported by Gradle, e.g. UseConcMarkSweepGC, also get passed to the bundled JDK used to run some of the setup tools like elasticsearch-keystore, and Java 15 doesn't understand UseConcMarkSweepGC. However, I am not sure why it's so hard to reproduce locally if this theory is correct, so it's probably wrong 😆 .

@droberts195 droberts195 added :Delivery/Build Build or test infrastructure >test-failure Triaged test failures from CI labels Sep 21, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (:Core/Infra/Build)

@elasticmachine elasticmachine added the Team:Core/Infra Meta label for core/infra team label Sep 21, 2020
@rjernst
Copy link
Member

rjernst commented Sep 22, 2020

@breskeby Can you take a look?

@orhantoy
Copy link
Contributor

Something we (Enterprise Search team) noticed is that 7.9.1 to 7.9.2 upgrades could be prevented by this issue.

@rjernst
Copy link
Member

rjernst commented Sep 22, 2020

@orhantoy The particular jvm option UseConcMarkSweepGC has been deprecated for a long time, and is removed as of Java 15. The bwc tests break because we have updated the bundled jdk to java 15 starting with 7.9.2. Users have had warnings about this for a long time. I see this issue as needing to adjust our bwc tests to adjust jvm options for our tests configured in gradle depending on the version of Elasticsearch. Upgrades aren't prevented; users need to adjust their jvm options.

@orhantoy
Copy link
Contributor

@rjernst All good, I just wanted to raise it in case it was missed. @jaymode explained it is expected and not a blocker.

@breskeby
Copy link
Contributor

I take a look on adjusting the jvm options for our bwc tests configured in gradle depending on the version of Elasticsearch.

@breskeby
Copy link
Contributor

breskeby commented Sep 24, 2020

I do not see where this UseConcMarkSweepGC flag is used in any of our tests. But I noticed the ci build uses the flag in its setup. We might want to remove it as @rjernst pointed out this is deprecated https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+g1gc/ Where are those ci jobs defined?

@geerlingguy
Copy link

I'm hitting this on Ubuntu 18.04 with:

# java --version
openjdk 11.0.8 2020-07-14
OpenJDK Runtime Environment (build 11.0.8+10-post-Ubuntu-0ubuntu118.04.1)
OpenJDK 64-Bit Server VM (build 11.0.8+10-post-Ubuntu-0ubuntu118.04.1, mixed mode, sharing)

As well as on CentOS 7 with:

# java -version
openjdk version "1.8.0_262"
OpenJDK Runtime Environment (build 1.8.0_262-b10)
OpenJDK 64-Bit Server VM (build 25.262-b10, mixed mode)

I run weekly build tests and it started failing yesterday, so it was contained within a release in the past week.

@breskeby
Copy link
Contributor

breskeby commented Sep 28, 2020

@geerlingguy can you share what tests you ran exactly? I'd like to reproduce this

@geerlingguy
Copy link

@breskeby - It is the CI suite for my open source Ansible geerlingguy.elasticsearch role: https://github.com/geerlingguy/ansible-role-elasticsearch

The CI test uses Molecule, and if you clone that repo, install Molecule (pip3 install molecule), and run molecule test you'll get the fail. But that role simply does the following:

  1. Add apt key https://artifacts.elastic.co/GPG-KEY-elasticsearch (or yum key on RHEL)
  2. Add apt repo (deb https://artifacts.elastic.co/packages/7.x/apt stable main) (or yum repo on RHEL)
  3. Install elasticsearch (with apt or yum/dnf, respectively)
  4. Copy templated config files into place (/etc/elasticsearch/elasticsearch.yml and /etc/elasticsearch/jvm.options)
  5. Start elasticsearch

It's at step 5 that it's failing.

@geerlingguy
Copy link

geerlingguy commented Sep 28, 2020

@breskeby - Ah, I just realized the problem in my instance is I am overriding the entire jvm.options file instead of adding my heap size modifications to an options file inside jvm.options.d, which is the preferred way of doing it.

I'm going to fix that in my role (edit: here's the fix), and it looks like the newer versions of Elasticsearch, it's jvm.options file (with the version identifiers in front of each of the directives) works correctly (my role was overriding that file).

@breskeby
Copy link
Contributor

@geerlingguy thanks for the update. I think we can close this then? //cc @rjernst ?

@rjernst
Copy link
Member

rjernst commented Sep 30, 2020

@breskeby I'm not sure the change made to CI is sufficient. I don't think we pass any environmental jvm options through to tests. So it would seem we still have an edge case that needs to be found, where an ElasticsearchNode from testclusters is running elasticsearch-keystore with jdk15, but with older jvm options.

@breskeby
Copy link
Contributor

turns out the change above didn't fix the issue and instead caused #63611 because I mixed up -XX:-UseConcMarkSweepGC with -XX:+UseConcMarkSweepGC

I try again these days to narrow down which bwc test combination causes this issue as I haven't found anything obvious in our Elasticsearch build setup.

@breskeby
Copy link
Contributor

@droberts195 I think your initial analysis is correct. What seem to happen is that bwc tests indeed use the bundled jdk and tests against latest master use the declared runtime jdk (which is currently 11 for master).

The reason you cannot reproduce this failure locally is that the failing CI job declares additional jvm arguments. By passing

-Dtests.jvm.argline="-XX-UseConcMarkSweepGC -XX:-UseSerialGC -XX:+UseG1GC" we basically tell java to remove potential configured garbage collectors UseConcMarkSweepGC and UseSerialGC and replace it by UseG1GC.

My initial attempt by just removing the culprit line of -XX-UseConcMarkSweepGC only works half way. Indeed it fixes the problem you see here by not using that now unknown flag with the bundled jdk for bwc tests.
BUT this makes tests against latest master fail. As mentioned earlier we actually use java 11 which is configured to use UseConcMarkSweepGC by default (see

8-13:-XX:+UseConcMarkSweepGC
)

By me removing -XX-UseConcMarkSweepGC from this CI job definition we started to fail with duplicate GC configuration as we now didn't remove properly this GC before adding G1.

I reverted my initial partial for now and think about the best way to fix this properly.

IMO We need to provide different jvm arguments for this particular case depending on which runtime we're using I think.

@rjernst
Copy link
Member

rjernst commented Oct 19, 2020

@breskeby I think we should consider working on a long-outstanding issue, to utilize our official jvm options within gradle. This is partially described in #32257. The jvm options already have version specific options. If we can move the specification into gradle as that issue describes, great, but we could also look at other ways to utilize the existing jvm.options file and merging any additional options specified in gradle.

@breskeby
Copy link
Contributor

Ah nice. Thanks for the pointer @rjernst wasn't aware of that issue

@astefan
Copy link
Contributor

astefan commented Oct 22, 2020

There are more failures yesterday and today for, what I think seems to be the same issue? I see > Process 'command './bin/elasticsearch-keystore'' finished with non-zero exit value 1 but not the actual reason for this failure, as it was the case in the initial report. All failures were on master and with g1gc though.

https://gradle-enterprise.elastic.co/s/hk3rw2usoztfs
https://gradle-enterprise.elastic.co/s/abi7l6gorg5we
https://gradle-enterprise.elastic.co/s/ke3h2g33wy6r2
https://gradle-enterprise.elastic.co/s/4rsydbxbwiupu
https://gradle-enterprise.elastic.co/s/hv56nq4bm5ez2

@davidkyle
Copy link
Member

More failures with the unknown option error:

:qa:verify-version-constants:v7.10.0#integTest FAILED   
Exec output and error:  
Output for ./bin/elasticsearch-keystore:Unrecognized VM option 'UseConcMarkSweepGC'   

The recent failures have all been on the master branch running the full cluster restart tests against 7.10. The old cluster (v7.10) fails to start with the above error.

18:16:25 * What went wrong:
18:16:25 Execution failed for task ':qa:full-cluster-restart:v7.10.0#oldClusterTest'.
18:16:25 > Process 'command './bin/elasticsearch-keystore'' finished with non-zero exit value 1

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+g1gc/449/console

I see that there isn't a simple fix for this, @breskeby is it possible to disable this build until the problem is resolved please?
I'm not the first person on test triage duty to see the build failures and end up here, it would be a shame for others to waste cycles performing the same investigation

@ywangd
Copy link
Member

ywangd commented Oct 30, 2020

One more: https://gradle-enterprise.elastic.co/s/mvomo62cyuyyw

The error message is the same

Exec output and error:
| Output for ./bin/elasticsearch-keystore:Unrecognized VM option 'UseConcMarkSweepGC'
| Error: Could not create the Java Virtual Machine.
| Error: A fatal exception has occurred. Program will exit.

@tvernum
Copy link
Contributor

tvernum commented Nov 9, 2020

Do we really need the G1GC Jenkins builds anymore?

Given that master-matrix+openjdk15 will automatically use G1GC (because JDK15 doesn't support CMS), what value do we get from having a specific G1GC CI build (that is currently failing due to this issue)?

@breskeby
Copy link
Contributor

breskeby commented Nov 9, 2020

I tend to agree with @tvernum. With openjdk15 g1GC this CI build doesn't give us much value actually

@droberts195 droberts195 changed the title [CI] Unrecognized VM option 'UseConcMarkSweepGC' in some BWC tests [CI] Unrecognized VM option 'UseConcMarkSweepGC' in some BWC tests for g1gc builds Nov 11, 2020
@droberts195 droberts195 changed the title [CI] Unrecognized VM option 'UseConcMarkSweepGC' in some BWC tests for g1gc builds [CI] Unrecognized VM option 'UseConcMarkSweepGC' in BWC tests for g1gc builds Nov 11, 2020
@mark-vieira mark-vieira added Team:Delivery Meta label for Delivery team and removed Team:Core/Infra Meta label for core/infra team labels Nov 11, 2020
@davidkyle
Copy link
Member

Still failing twice a day: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+g1gc/

@mark-vieira could we possibly mute this build failure please? I'm sure everyone who has to cover test triage would appreciate it thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Delivery/Build Build or test infrastructure Team:Delivery Meta label for Delivery team >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests