Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make All OS tests run on GCP instances #46924

Merged
merged 47 commits into from
Oct 4, 2019

Conversation

alpar-t
Copy link
Contributor

@alpar-t alpar-t commented Sep 20, 2019

This PR makes the necesary adaptations to the tests and adds a power shell script to
invoke the OS tests on GCP instances connected as CI workers.

Also noticed that logs were not being produced by the tests and that theses were not using log4j so fixed that too.

One of the difficulties in working on theses tests was that the tests just stalled with no indication where the problem is.
To ease with the debugging, after process explorer suggested that the tests are running some commands, we now have multiple timeouts: one for the tests ( which will generate a thread dump ) and one for individual commands ( that bails with the command being ran and output and error so far ) to make it easier to see what went wrong.

The tests were blocking because apparently the pipes to the sub-process were not closing, thus the threads were blocking on them and we were blocking indefinitely on the join. I'm not sure why this doesn't happen in vagrant, but we now properly deal with it.

This PR makes the necesary adaptations to the tests  and adds a power shell script to
invoke the OS tests on GCP instances connected as CI workers.
@alpar-t alpar-t added :Delivery/Packaging RPM and deb packaging, tar and zip archives, shell and batch scripts v8.0.0 v7.5.0 labels Sep 20, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@alpar-t
Copy link
Contributor Author

alpar-t commented Sep 20, 2019

Here's a build scan from my testing: https://scans.gradle.com/s/lpgxftanojdv2/tests

@rjernst it seems we are running each test multiple times, but we can address that separately

"$process.Start() | Out-Null; " +
"$process.Id;"
);
if (System.getenv("username").equals("vagrant")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the PowerShell lines in the following sh.run() calls are the same. Is there any value in reducing the duplication here?

Copy link
Contributor

@pugnascotia pugnascotia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't say anything about the PowerShell, but the rest looks good.

Copy link
Contributor

@mark-vieira mark-vieira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to defer to @rjernst for the most part here since he's most familiar with the packaging test infrastructure.

.ci/os.ps1 Outdated
Remove-Item -Recurse -Force $gradleInit -ErrorAction Ignore
New-Item -ItemType directory -Path $gradleInit

# Copy-Item .ci/init.gradle -Destination $gradleInit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be commented out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's left over from testing, I will un-comment it.

@mark-vieira
Copy link
Contributor

This PR makes the necesary adaptations to the tests and adds a power shell script to
invoke the OS tests on GCP instances connected as CI workers.

Just to confirm, the build isn't actually spinning up workers. What we intend to do is launch a Windows CI work, then directly invoke the destructive test task on that worker instead of inside a Vagrant VM. Yes?

@alpar-t
Copy link
Contributor Author

alpar-t commented Sep 24, 2019

Just to confirm, the build isn't actually spinning up workers. What we intend to do is launch a Windows CI work, then directly invoke the destructive test task on that worker instead of inside a Vagrant VM. Yes?

That's right. We will have a matrix job so CI will spin up the workers and invoke the ps1 script from this PR on each.

@alpar-t alpar-t changed the title Make Windows OS tests run on GCP instances Make All OS tests run on GCP instances Sep 27, 2019
@alpar-t
Copy link
Contributor Author

alpar-t commented Sep 30, 2019

Windows test on CI workers passed : https://gradle-enterprise.elastic.co/s/p4qyrkswxipm4

@alpar-t
Copy link
Contributor Author

alpar-t commented Sep 30, 2019

@alpar-t
Copy link
Contributor Author

alpar-t commented Sep 30, 2019

Debian also passing with the last fix: https://gradle-enterprise.elastic.co/s/xrc6baytgnqqg

@alpar-t
Copy link
Contributor Author

alpar-t commented Sep 30, 2019

@elasticmachine run elasticsearch-ci/2

Copy link
Member

@rjernst rjernst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments


# CI configures these, uncoment if running manually
#
# $env:ES_BUILD_JAVA="java12"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just set these if they are not set so manual editing is not necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was planning to read .ci/java-versions.properties like in Linux, but in a follow up.

$ErrorActionPreference="Continue"
# TODO: remove the task exclusions once dependencies are set correctly and these don't run for Windows or buldiung the deb on windows is fixed
& .\gradlew.bat -g "C:\Users\$env:username\.gradle" --parallel --scan --console=plain destructiveDistroTest `
-x :distribution:packages:buildOssDeb `
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the task dep causing these to be built? We should never even try building these if they won't be used by the test, which was a major goal of the refactoring work here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, but has to be something relating to the OS tests.
I assumed the refactoring is not yet complete.
We built these before because we check that these can be extracted but I already addressed that.

Either way I would prefer to look at this in a separate PR.

}
// we don't require java be installed for the tests, but remove it if it's there
// since we don't require it for the tests, don't bother restoring it
if (Files.exists(Paths.get("/usr/bin/java"))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was the previous code causing problems? If we are going to ensure /usr/bin/java doesn't exist, we should do so outside of test execution. This seems like an environmental issue, not something a test (in the middle of other tests running) should be changing. Deleting and not restoring system files is a serious, unexpected side effect to a test running.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vagrant VMs have a java package installed whereas CI workers do not.
Since we label these as destructive and always run them on ephemeral VMs, the restoring of this made the test unnecessarily hard to read. I think it's worth restoring when it makes it possible to re-use a VM e.x. for development, but in this case, the test works if there's no java installed as well.

The right fix here is probably to make sure java is not installed on the vagrant images ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the restore back to make this less awkward.

return;
}
logger.info("Showing contents of directory: {} ({})", logsDir, logsDir.toAbsolutePath());
try(Stream<Path> fileStream = Files.list(logsDir)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: space after try

public void chown(Path path) throws Exception {
Platforms.onLinux(() -> run("chown -R elasticsearch:elasticsearch " + path));
Platforms.onWindows(() -> run(
"$account = New-Object System.Security.Principal.NTAccount '" + System.getenv("username") + "'; " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we have the constant ARCHIVE_USER for use here and above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's ARCHIVE_OWNER but that's part of the ArchiveTests and has hard coded values.

try {
Path tmpDir = Paths.get(System.getProperty("java.io.tmpdir"));
Files.createDirectories(tmpDir);
stdOut = Files.createTempFile(tmpDir, getClass().getName(), ".out");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are these files cleaned up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, they were not. I added cleanup.

}
LinkedList<String> result = new LinkedList<>();
AtomicBoolean linesDiscarded = new AtomicBoolean(false);
try(Stream<String> lines = Files.lines(path, StandardCharsets.UTF_8)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: space after try

try(Stream<String> lines = Files.lines(path, StandardCharsets.UTF_8)) {
lines.forEach(line -> {
result.add(line);
if (result.size() >= TAIL_WHEN_TOO_MUCH_OUTPUT) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we omitting lines at all? Once the VM is destroyed we can't get them out. If we want to debate about the merits of what is output fine, but I think it should be a separate PR from this already large PR to get the OS tests running in gcp?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to remove omitting the lines, but will keep the size check.
One of the windows installer tests is generating a 16GB stderr.
I don't know what's in it because I didn't download it and nothing on windows could open it...

Copy link
Member

@rjernst rjernst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, given the followup work you noted in response to some previous comments.

@alpar-t alpar-t merged commit f962d1c into elastic:master Oct 4, 2019
@alpar-t alpar-t deleted the windows-packaging-gcp branch October 4, 2019 05:41
alpar-t added a commit that referenced this pull request Oct 4, 2019
This PR makes the necesary adaptations to the tests and adds a power shell script to
invoke the OS tests on GCP instances connected as CI workers.

Also noticed that logs were not being produced by the tests and that theses were not using log4j so fixed that too.

One of the difficulties in working on theses tests was that the tests just stalled with no indication where the problem is.
To ease with the debugging, after process explorer suggested that the tests are running some commands, we now have multiple timeouts: one for the tests ( which will generate a thread dump ) and one for individual commands ( that bails with the command being ran and output and error so far ) to make it easier to see what went wrong.

The tests were blocking because apparently the pipes to the sub-process were not closing, thus the threads were blocking on them and we were blocking indefinitely on the join. I'm not sure why this doesn't happen in vagrant, but we now properly deal with it.
@jimczi jimczi added the >test Issues or PRs that are addressing/adding tests label Nov 12, 2019
@mark-vieira mark-vieira added the Team:Delivery Meta label for Delivery team label Nov 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Delivery/Packaging RPM and deb packaging, tar and zip archives, shell and batch scripts Team:Delivery Meta label for Delivery team >test Issues or PRs that are addressing/adding tests v7.5.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants