-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make All OS tests run on GCP instances #46924
Conversation
This PR makes the necesary adaptations to the tests and adds a power shell script to invoke the OS tests on GCP instances connected as CI workers.
Pinging @elastic/es-core-infra |
Here's a build scan from my testing: https://scans.gradle.com/s/lpgxftanojdv2/tests @rjernst it seems we are running each test multiple times, but we can address that separately |
"$process.Start() | Out-Null; " + | ||
"$process.Id;" | ||
); | ||
if (System.getenv("username").equals("vagrant")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the PowerShell lines in the following sh.run()
calls are the same. Is there any value in reducing the duplication here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't say anything about the PowerShell, but the rest looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going to defer to @rjernst for the most part here since he's most familiar with the packaging test infrastructure.
.ci/os.ps1
Outdated
Remove-Item -Recurse -Force $gradleInit -ErrorAction Ignore | ||
New-Item -ItemType directory -Path $gradleInit | ||
|
||
# Copy-Item .ci/init.gradle -Destination $gradleInit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be commented out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that's left over from testing, I will un-comment it.
Just to confirm, the build isn't actually spinning up workers. What we intend to do is launch a Windows CI work, then directly invoke the |
That's right. We will have a matrix job so CI will spin up the workers and invoke the |
Windows test on CI workers passed : https://gradle-enterprise.elastic.co/s/p4qyrkswxipm4 |
CentOS passed: https://gradle-enterprise.elastic.co/s/dai35e6bvhk2m |
qa/os/src/test/java/org/elasticsearch/packaging/test/ArchiveTests.java
Outdated
Show resolved
Hide resolved
Debian also passing with the last fix: https://gradle-enterprise.elastic.co/s/xrc6baytgnqqg |
@elasticmachine run elasticsearch-ci/2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments
|
||
# CI configures these, uncoment if running manually | ||
# | ||
# $env:ES_BUILD_JAVA="java12" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just set these if they are not set so manual editing is not necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was planning to read .ci/java-versions.properties
like in Linux, but in a follow up.
$ErrorActionPreference="Continue" | ||
# TODO: remove the task exclusions once dependencies are set correctly and these don't run for Windows or buldiung the deb on windows is fixed | ||
& .\gradlew.bat -g "C:\Users\$env:username\.gradle" --parallel --scan --console=plain destructiveDistroTest ` | ||
-x :distribution:packages:buildOssDeb ` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the task dep causing these to be built? We should never even try building these if they won't be used by the test, which was a major goal of the refactoring work here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, but has to be something relating to the OS tests.
I assumed the refactoring is not yet complete.
We built these before because we check that these can be extracted but I already addressed that.
Either way I would prefer to look at this in a separate PR.
} | ||
// we don't require java be installed for the tests, but remove it if it's there | ||
// since we don't require it for the tests, don't bother restoring it | ||
if (Files.exists(Paths.get("/usr/bin/java"))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was the previous code causing problems? If we are going to ensure /usr/bin/java doesn't exist, we should do so outside of test execution. This seems like an environmental issue, not something a test (in the middle of other tests running) should be changing. Deleting and not restoring system files is a serious, unexpected side effect to a test running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vagrant VMs have a java package installed whereas CI workers do not.
Since we label these as destructive and always run them on ephemeral VMs, the restoring of this made the test unnecessarily hard to read. I think it's worth restoring when it makes it possible to re-use a VM e.x. for development, but in this case, the test works if there's no java installed as well.
The right fix here is probably to make sure java is not installed on the vagrant images ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the restore back to make this less awkward.
return; | ||
} | ||
logger.info("Showing contents of directory: {} ({})", logsDir, logsDir.toAbsolutePath()); | ||
try(Stream<Path> fileStream = Files.list(logsDir)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: space after try
public void chown(Path path) throws Exception { | ||
Platforms.onLinux(() -> run("chown -R elasticsearch:elasticsearch " + path)); | ||
Platforms.onWindows(() -> run( | ||
"$account = New-Object System.Security.Principal.NTAccount '" + System.getenv("username") + "'; " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we have the constant ARCHIVE_USER for use here and above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's ARCHIVE_OWNER
but that's part of the ArchiveTests
and has hard coded values.
try { | ||
Path tmpDir = Paths.get(System.getProperty("java.io.tmpdir")); | ||
Files.createDirectories(tmpDir); | ||
stdOut = Files.createTempFile(tmpDir, getClass().getName(), ".out"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where are these files cleaned up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, they were not. I added cleanup.
} | ||
LinkedList<String> result = new LinkedList<>(); | ||
AtomicBoolean linesDiscarded = new AtomicBoolean(false); | ||
try(Stream<String> lines = Files.lines(path, StandardCharsets.UTF_8)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: space after try
try(Stream<String> lines = Files.lines(path, StandardCharsets.UTF_8)) { | ||
lines.forEach(line -> { | ||
result.add(line); | ||
if (result.size() >= TAIL_WHEN_TOO_MUCH_OUTPUT) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we omitting lines at all? Once the VM is destroyed we can't get them out. If we want to debate about the merits of what is output fine, but I think it should be a separate PR from this already large PR to get the OS tests running in gcp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to remove omitting the lines, but will keep the size check.
One of the windows installer tests is generating a 16GB stderr.
I don't know what's in it because I didn't download it and nothing on windows could open it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, given the followup work you noted in response to some previous comments.
This PR makes the necesary adaptations to the tests and adds a power shell script to invoke the OS tests on GCP instances connected as CI workers. Also noticed that logs were not being produced by the tests and that theses were not using log4j so fixed that too. One of the difficulties in working on theses tests was that the tests just stalled with no indication where the problem is. To ease with the debugging, after process explorer suggested that the tests are running some commands, we now have multiple timeouts: one for the tests ( which will generate a thread dump ) and one for individual commands ( that bails with the command being ran and output and error so far ) to make it easier to see what went wrong. The tests were blocking because apparently the pipes to the sub-process were not closing, thus the threads were blocking on them and we were blocking indefinitely on the join. I'm not sure why this doesn't happen in vagrant, but we now properly deal with it.
This PR makes the necesary adaptations to the tests and adds a power shell script to
invoke the OS tests on GCP instances connected as CI workers.
Also noticed that logs were not being produced by the tests and that theses were not using log4j so fixed that too.
One of the difficulties in working on theses tests was that the tests just stalled with no indication where the problem is.
To ease with the debugging, after process explorer suggested that the tests are running some commands, we now have multiple timeouts: one for the tests ( which will generate a thread dump ) and one for individual commands ( that bails with the command being ran and output and error so far ) to make it easier to see what went wrong.
The tests were blocking because apparently the pipes to the sub-process were not closing, thus the threads were blocking on them and we were blocking indefinitely on the join. I'm not sure why this doesn't happen in vagrant, but we now properly deal with it.