-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run system tests in parallel #1909
Conversation
|
||
partial, err := r.runTest(ctx, testConfig, svcInfo) | ||
|
||
tdErr := r.tearDownTest(ctx) | ||
if err != nil { | ||
return partial, err | ||
return partial, errors.Join(err, fmt.Errorf("failed to tear down runner: %w", tdErr)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not able yet to show the teardown error when the context was cancelled (pressing Ctrl+C).
As it was previously, deleting test policies handler was failing but the error did not raised
^C2024/06/13 19:32:38 DEBUG deleting test policies...
2024/06/13 19:32:38 INFO Signal caught!
2024/06/13 19:32:38 DEBUG POST https://127.0.0.1:5601/api/fleet/agent_policies/delete
2024/06/13 19:32:39 DEBUG POST https://127.0.0.1:5601/api/fleet/agent_policies/delete
2024/06/13 19:32:39 DEBUG Failed deleting test policy handler...: error cleaning up test policy: could not delete policy; API status code = 400; response body = {"statusCode":400,"error":"Bad Request","message":"Cannot delete an agent policy that is assigned to any active or inactive agents"}
2024/06/13 19:32:39 ERROR context error: context canceled
2024/06/13 19:32:39 ERROR context error: context canceled
2024/06/13 19:32:39 ERROR context error: context canceled
2024/06/13 19:32:39 DEBUG Uninstalling package...
2024/06/13 19:32:39 DEBUG GET https://127.0.0.1:5601/api/fleet/epm/packages/nginx
2024/06/13 19:32:40 DEBUG DELETE https://127.0.0.1:5601/api/fleet/epm/packages/nginx/999.999.999
interrupted
Command exited with non-zero status 130
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Errors caused by interruptions are not shown by intention, they use to be long chains of wrapped errors not very useful.
We capture interruptions here:
Lines 27 to 30 in ebb1446
if errIsInterruption(err) { | |
rootCmd.Println("interrupted") | |
os.Exit(130) | |
} |
If both the test and the tear down fail, I think we are fine with returning only the test error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, understood. I'll remove then this errors.Join
Thanks!
r.dataStreamManifest, err = packages.ReadDataStreamManifest(filepath.Join(r.dataStreamPath, packages.DataStreamManifestFile)) | ||
if err != nil { | ||
return nil, fmt.Errorf("reading data stream manifest failed: %w", err) | ||
} | ||
|
||
// Temporarily until independent Elastic Agents are enabled by default, | ||
// enable independent Elastic Agents if package defines that requires root privileges | ||
if pkg, ds := r.pkgManifest, r.dataStreamManifest; pkg.Agent.Privileges.Root || (ds != nil && ds.Agent.Privileges.Root) { | ||
r.runIndependentElasticAgent = true | ||
} | ||
|
||
// If the environment variable is present, it always has preference over the root | ||
// privileges value (if any) defined in the manifest file | ||
v, ok := os.LookupEnv(enableIndependentAgents) | ||
if ok { | ||
r.runIndependentElasticAgent = strings.ToLower(v) == "true" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added to the constructor so it can be queried whether or not it is running with independent Elastic Agents.
if err != nil { | ||
return nil, err | ||
} | ||
|
||
scenario.agent = agentDeployed | ||
|
||
service, svcInfo, err := r.setupService(ctx, config, serviceOptions, svcInfo, agentInfo, agentDeployed, policy, serviceStateData) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delay setting up service, after setting the handlers related to the Agent
- shutdownAgentHandler (this is set within
setupAgent
method) - removeAgentHandler
- resetAgentPolicyHandler
- resetAgentLogLevelHandler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It cannot be done this change.
In servicedeployer
package, there are some agents created there (custom agents and K8s agents). And therefore, it is required to call checkEnrolledAgents
after setupService
.
At least, until it is kept support for those scenarios using Elastic Agent from stack.
@@ -149,7 +149,7 @@ func (tsd TerraformServiceDeployer) SetUp(ctx context.Context, svcInfo ServiceIn | |||
} | |||
err = p.Up(ctx, opts) | |||
if err != nil { | |||
return nil, fmt.Errorf("could not boot up service using Docker Compose: %w", err) | |||
return &service, fmt.Errorf("could not boot up service using Docker Compose: %w", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe these changes could be done in a different PR.
I was thinking that it would be interesting to return also the service here in case of error, to be able to run the shutdownServiceHandler
.
The same if there is any error while waiting to be healthy the container.
WDYT ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed for now these changes.
logger.Debugf("Container %s status: %s (no health status)", containerDescription.ID, containerDescription.State.Status) | ||
continue | ||
} | ||
|
||
// Service is up and running and it's healthy | ||
if containerDescription.State.Status == "running" && containerDescription.State.Health.Status == "healthy" { | ||
logger.Debugf("Container %s status: %s (health: %s)", containerDescription.ID, containerDescription.State.Status, containerDescription.State.Health.Status) | ||
continue | ||
} | ||
|
||
// Container started and finished with exit code 0 | ||
if containerDescription.State.Status == "exited" && containerDescription.State.ExitCode == 0 { | ||
logger.Debugf("Container %s status: %s (exit code: %d)", containerDescription.ID, containerDescription.State.Status, containerDescription.State.ExitCode) | ||
continue | ||
} | ||
|
||
// Container exited with code > 0 | ||
if containerDescription.State.Status == "exited" && containerDescription.State.ExitCode > 0 { | ||
logger.Debugf("Container %s status: %s (exit code: %d)", containerDescription.ID, containerDescription.State.Status, containerDescription.State.ExitCode) | ||
return fmt.Errorf("container (ID: %s) exited with code %d", containerDescription.ID, containerDescription.State.ExitCode) | ||
} | ||
|
||
// Any different status is considered unhealthy | ||
logger.Debugf("Container %s status: unhealthy", containerDescription.ID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Examples of logs withthese changes:
2024/06/13 19:32:35 DEBUG Wait for healthy containers: 8f38b80e7a6b5638fd3e53bc250f49f5c2e27fa10810de506e4495d7169d4a5c
2024/06/13 19:32:35 DEBUG output command: /usr/bin/docker inspect 8f38b80e7a6b5638fd3e53bc250f49f5c2e27fa10810de506e4495d7169d4a5c
2024/06/13 19:32:35 DEBUG Container 8f38b80e7a6b5638fd3e53bc250f49f5c2e27fa10810de506e4495d7169d4a5c status: unhealthy
2024/06/13 19:32:36 DEBUG Wait for healthy containers: 8f38b80e7a6b5638fd3e53bc250f49f5c2e27fa10810de506e4495d7169d4a5c
2024/06/13 19:32:36 DEBUG output command: /usr/bin/docker inspect 8f38b80e7a6b5638fd3e53bc250f49f5c2e27fa10810de506e4495d7169d4a5c
2024/06/13 19:32:36 DEBUG Container 8f38b80e7a6b5638fd3e53bc250f49f5c2e27fa10810de506e4495d7169d4a5c status: running (health: healthy)
Previously:
2024/06/13 15:43:06 DEBUG Wait for healthy containers: 2306f9dd160408ce2e26b72565d1d3037153de14167052733f2cfd553f212de4
2024/06/13 15:43:06 DEBUG output command: /usr/bin/docker inspect 2306f9dd160408ce2e26b72565d1d3037153de14167052733f2cfd553f212de4
2024/06/13 15:43:06 DEBUG Container status: {"Config":{"Image":"elastic-package-service-87356-nginx","Labels":{"com.docker.compose.config-hash":"07b2d6403034e11414161c6489c787924e32ab36e7d12c69b977e184cf40e3c3","com.docker.compose.container-number":"1","com.docker.compose.depends_on":"","com.docker.compose.image":"sha256:106ba62762b92ccdde0edf49e09063ee28a3be98e9342dfcd3980314b0e4c192","com.docker.compose.oneoff":"False","com.docker.compose.project":"elastic-package-service-87356","com.docker.compose.project.config_files":"/opt/buildkite-agent/builds/bk-agent-prod-gcp-1718292959927056914/elastic/elastic-package/test/packages/parallel/nginx/_dev/deploy/docker/docker-compose.yml","com.docker.compose.project.working_dir":"/opt/buildkite-agent/builds/bk-agent-prod-gcp-1718292959927056914/elastic/elastic-package/test/packages/parallel/nginx/_dev/deploy/docker","com.docker.compose.service":"nginx","com.docker.compose.version":"2.24.1","maintainer":"NGINX Docker Maintainers \[email protected]\u003e"}},"ID":"2306f9dd160408ce2e26b72565d1d3037153de14167052733f2cfd553f212de4","State":{"Status":"running","ExitCode":0,"Health":{"Status":"healthy","Log":[{"Start":"2024-06-13T15:43:05.983549466Z","ExitCode":0,"Output":" % Total % Received % Xferd Average Speed Time Time Time Current\n Dload Upload Total Spent Left Speed\n\r 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0\r100 97 100 97 0 0 97000 0 --:--:-- --:--:-- --:--:-- 97000\nActive connections: 1 \nserver accepts handled requests\n 1 1 1 \nReading: 0 Writing: 1 Waiting: 0 \n"}]}}}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe these logs are too verbose? But we can leave them and update later when/if we add the trace debug level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean to keep the previous log messages ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean to leave it as is now in the PR, and reduce later when we have more log levels.
@@ -118,7 +118,8 @@ func (c *Client) waitUntilPolicyAssigned(ctx context.Context, a Agent, p Policy) | |||
if err != nil { | |||
return fmt.Errorf("can't get the agent: %w", err) | |||
} | |||
logger.Debugf("Agent data: %s", agent.String()) | |||
logger.Debugf("Agent %s (Host: %s): Policy ID %s LogLevel: %s Status: %s", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably it could be improved better these logs (removing some field?)
Example of new logs:
2024/06/13 19:33:21 DEBUG Wait until the policy (ID: 1355408f-6518-4bbd-8b39-cded15bba61d, revision: 2) is assigned to the agent (ID: ca20edcb-a929-434d-b226-9138be510e80)...
2024/06/13 19:33:23 DEBUG GET https://127.0.0.1:5601/api/fleet/agents/ca20edcb-a929-434d-b226-9138be510e80
2024/06/13 19:33:23 DEBUG Agent ca20edcb-a929-434d-b226-9138be510e80 (Host: elastic-agent-12113): Policy ID 1355408f-6518-4bbd-8b39-cded15bba61d LogLevel: debug Status: updating
Previously:
2024/06/13 15:43:12 DEBUG Wait until the policy (ID: c9f48c54-c30e-4bb7-a17a-a66f6c1776b0, revision: 2) is assigned to the agent (ID: 8bf156ea-30b2-4714-8e99-d09059f0fe54)...
2024/06/13 15:43:14 DEBUG GET https://127.0.0.1:5601/api/fleet/agents/8bf156ea-30b2-4714-8e99-d09059f0fe54
2024/06/13 15:43:14 DEBUG Agent data: {"id":"8bf156ea-30b2-4714-8e99-d09059f0fe54","policy_id":"c9f48c54-c30e-4bb7-a17a-a66f6c1776b0","local_metadata":{"host":{"name":"docker-fleet-agent"},"elastic":{"agent":{"log_level":"info"}}},"status":"updating"}
|
||
if err != nil { | ||
return nil, svcInfo, fmt.Errorf("could not setup service: %w", err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we move this error check, to allow being set the handler ?
@@ -1,6 +1,7 @@ | |||
env: | |||
SETUP_GVM_VERSION: 'v0.5.2' # https://github.com/andrewkroh/gvm/issues/44#issuecomment-1013231151 | |||
ELASTIC_PACKAGE_COMPOSE_DISABLE_VERBOSE_OUTPUT: "true" | |||
ELASTIC_PACKAGE_MAXIMUM_NUMBER_PARALLEL_TESTS: 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If package has enabled system tests in parallel, those tests will be triggered using 3 routines as maximum.
go.mod
Outdated
|
||
replace github.com/elastic/package-spec/v3 => github.com/mrodm/package-spec/v3 v3.0.0-20240613145150-8d065e83d217 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requires a new version of package-spec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, also tested locally and works as expected.
Commenting on some of the open questions.
logger.Debugf("Container %s status: %s (no health status)", containerDescription.ID, containerDescription.State.Status) | ||
continue | ||
} | ||
|
||
// Service is up and running and it's healthy | ||
if containerDescription.State.Status == "running" && containerDescription.State.Health.Status == "healthy" { | ||
logger.Debugf("Container %s status: %s (health: %s)", containerDescription.ID, containerDescription.State.Status, containerDescription.State.Health.Status) | ||
continue | ||
} | ||
|
||
// Container started and finished with exit code 0 | ||
if containerDescription.State.Status == "exited" && containerDescription.State.ExitCode == 0 { | ||
logger.Debugf("Container %s status: %s (exit code: %d)", containerDescription.ID, containerDescription.State.Status, containerDescription.State.ExitCode) | ||
continue | ||
} | ||
|
||
// Container exited with code > 0 | ||
if containerDescription.State.Status == "exited" && containerDescription.State.ExitCode > 0 { | ||
logger.Debugf("Container %s status: %s (exit code: %d)", containerDescription.ID, containerDescription.State.Status, containerDescription.State.ExitCode) | ||
return fmt.Errorf("container (ID: %s) exited with code %d", containerDescription.ID, containerDescription.State.ExitCode) | ||
} | ||
|
||
// Any different status is considered unhealthy | ||
logger.Debugf("Container %s status: unhealthy", containerDescription.ID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe these logs are too verbose? But we can leave them and update later when/if we add the trace debug level.
|
||
partial, err := r.runTest(ctx, testConfig, svcInfo) | ||
|
||
tdErr := r.tearDownTest(ctx) | ||
if err != nil { | ||
return partial, err | ||
return partial, errors.Join(err, fmt.Errorf("failed to tear down runner: %w", tdErr)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Errors caused by interruptions are not shown by intention, they use to be long chains of wrapped errors not very useful.
We capture interruptions here:
Lines 27 to 30 in ebb1446
if errIsInterruption(err) { | |
rootCmd.Println("interrupted") | |
os.Exit(130) | |
} |
If both the test and the tear down fail, I think we are fine with returning only the test error.
) | ||
|
||
type testConfig struct { | ||
testrunner.SkippableConfig `config:",inline"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the option to skip individual policy tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great 👍
logger.Debugf("Container %s status: %s (no health status)", containerDescription.ID, containerDescription.State.Status) | ||
continue | ||
} | ||
|
||
// Service is up and running and it's healthy | ||
if containerDescription.State.Status == "running" && containerDescription.State.Health.Status == "healthy" { | ||
logger.Debugf("Container %s status: %s (health: %s)", containerDescription.ID, containerDescription.State.Status, containerDescription.State.Health.Status) | ||
continue | ||
} | ||
|
||
// Container started and finished with exit code 0 | ||
if containerDescription.State.Status == "exited" && containerDescription.State.ExitCode == 0 { | ||
logger.Debugf("Container %s status: %s (exit code: %d)", containerDescription.ID, containerDescription.State.Status, containerDescription.State.ExitCode) | ||
continue | ||
} | ||
|
||
// Container exited with code > 0 | ||
if containerDescription.State.Status == "exited" && containerDescription.State.ExitCode > 0 { | ||
logger.Debugf("Container %s status: %s (exit code: %d)", containerDescription.ID, containerDescription.State.Status, containerDescription.State.ExitCode) | ||
return fmt.Errorf("container (ID: %s) exited with code %d", containerDescription.ID, containerDescription.State.ExitCode) | ||
} | ||
|
||
// Any different status is considered unhealthy | ||
logger.Debugf("Container %s status: unhealthy", containerDescription.ID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean to leave it as is now in the PR, and reduce later when we have more log levels.
💛 Build succeeded, but was flaky
Failed CI Steps
History
cc @mrodm |
Depends on elastic/package-spec#759
Part of #787
Run system tests in parallel:
nginx
andapache
packages to run system tests in parallel.Reordered when theshutdownAgentHandler
is set, just after setting up the Agentservicedeployer
package (custom agents or k8s agents).Changed how shutdown service handler is set, even if there is an errorRevertedChecklist
How to test