Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run system tests in parallel #1909

Merged
merged 46 commits into from
Jun 19, 2024
Merged

Run system tests in parallel #1909

merged 46 commits into from
Jun 19, 2024

Conversation

mrodm
Copy link
Contributor

@mrodm mrodm commented Jun 13, 2024

Depends on elastic/package-spec#759
Part of #787

Run system tests in parallel:

  • Created new methods to trigger in parallel (using routines)
  • Added new global test configuration files to define whether or not a package can be run in parallel, or all tests of a given test type should be skipped.
    • By default, tests are run sequentially
  • Added environment variables to define independent agents and how many routines can be triggered in parallel
    • Updated nginx and apache packages to run system tests in parallel.
  • Reordered when the shutdownAgentHandler is set, just after setting up the Agent
    • Required to keep the same order while agents could be created in servicedeployer package (custom agents or k8s agents).
  • Changed how shutdown service handler is set, even if there is an error Reverted

Checklist

How to test

  1. Add in a package a global test configuration file
system:
  parallel: true
  1. Run system tests
elastic-package stack up -v -d

cd test/packages/parallel/nginx/

ELASTIC_PACKAGE_MAXIMUM_NUMBER_PARALLEL_TESTS=3 ELASTIC_PACKAGE_TEST_ENABLE_INDEPENDENT_AGENT=true elastic-package test system -v

cd test/packages/parallel/apache/

ELASTIC_PACKAGE_MAXIMUM_NUMBER_PARALLEL_TESTS=3 ELASTIC_PACKAGE_TEST_ENABLE_INDEPENDENT_AGENT=true elastic-package test system -v


elastic-package stack down -v

@mrodm mrodm self-assigned this Jun 13, 2024

partial, err := r.runTest(ctx, testConfig, svcInfo)

tdErr := r.tearDownTest(ctx)
if err != nil {
return partial, err
return partial, errors.Join(err, fmt.Errorf("failed to tear down runner: %w", tdErr))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not able yet to show the teardown error when the context was cancelled (pressing Ctrl+C).

As it was previously, deleting test policies handler was failing but the error did not raised

^C2024/06/13 19:32:38 DEBUG deleting test policies...
2024/06/13 19:32:38  INFO Signal caught!
2024/06/13 19:32:38 DEBUG POST https://127.0.0.1:5601/api/fleet/agent_policies/delete
2024/06/13 19:32:39 DEBUG POST https://127.0.0.1:5601/api/fleet/agent_policies/delete
2024/06/13 19:32:39 DEBUG Failed deleting test policy handler...: error cleaning up test policy: could not delete policy; API status code = 400; response body = {"statusCode":400,"error":"Bad Request","message":"Cannot delete an agent policy that is assigned to any active or inactive agents"}
2024/06/13 19:32:39 ERROR context error: context canceled
2024/06/13 19:32:39 ERROR context error: context canceled
2024/06/13 19:32:39 ERROR context error: context canceled
2024/06/13 19:32:39 DEBUG Uninstalling package...
2024/06/13 19:32:39 DEBUG GET https://127.0.0.1:5601/api/fleet/epm/packages/nginx
2024/06/13 19:32:40 DEBUG DELETE https://127.0.0.1:5601/api/fleet/epm/packages/nginx/999.999.999
interrupted
Command exited with non-zero status 130

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Errors caused by interruptions are not shown by intention, they use to be long chains of wrapped errors not very useful.

We capture interruptions here:

elastic-package/main.go

Lines 27 to 30 in ebb1446

if errIsInterruption(err) {
rootCmd.Println("interrupted")
os.Exit(130)
}

If both the test and the tear down fail, I think we are fine with returning only the test error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, understood. I'll remove then this errors.Join

Thanks!

Comment on lines +250 to +266
r.dataStreamManifest, err = packages.ReadDataStreamManifest(filepath.Join(r.dataStreamPath, packages.DataStreamManifestFile))
if err != nil {
return nil, fmt.Errorf("reading data stream manifest failed: %w", err)
}

// Temporarily until independent Elastic Agents are enabled by default,
// enable independent Elastic Agents if package defines that requires root privileges
if pkg, ds := r.pkgManifest, r.dataStreamManifest; pkg.Agent.Privileges.Root || (ds != nil && ds.Agent.Privileges.Root) {
r.runIndependentElasticAgent = true
}

// If the environment variable is present, it always has preference over the root
// privileges value (if any) defined in the manifest file
v, ok := os.LookupEnv(enableIndependentAgents)
if ok {
r.runIndependentElasticAgent = strings.ToLower(v) == "true"
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to the constructor so it can be queried whether or not it is running with independent Elastic Agents.

if err != nil {
return nil, err
}

scenario.agent = agentDeployed

service, svcInfo, err := r.setupService(ctx, config, serviceOptions, svcInfo, agentInfo, agentDeployed, policy, serviceStateData)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delay setting up service, after setting the handlers related to the Agent

  • shutdownAgentHandler (this is set within setupAgent method)
  • removeAgentHandler
  • resetAgentPolicyHandler
  • resetAgentLogLevelHandler

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It cannot be done this change.
In servicedeployer package, there are some agents created there (custom agents and K8s agents). And therefore, it is required to call checkEnrolledAgents after setupService.

At least, until it is kept support for those scenarios using Elastic Agent from stack.

@@ -149,7 +149,7 @@ func (tsd TerraformServiceDeployer) SetUp(ctx context.Context, svcInfo ServiceIn
}
err = p.Up(ctx, opts)
if err != nil {
return nil, fmt.Errorf("could not boot up service using Docker Compose: %w", err)
return &service, fmt.Errorf("could not boot up service using Docker Compose: %w", err)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe these changes could be done in a different PR.

I was thinking that it would be interesting to return also the service here in case of error, to be able to run the shutdownServiceHandler.

The same if there is any error while waiting to be healthy the container.

WDYT ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed for now these changes.

Comment on lines +376 to +399
logger.Debugf("Container %s status: %s (no health status)", containerDescription.ID, containerDescription.State.Status)
continue
}

// Service is up and running and it's healthy
if containerDescription.State.Status == "running" && containerDescription.State.Health.Status == "healthy" {
logger.Debugf("Container %s status: %s (health: %s)", containerDescription.ID, containerDescription.State.Status, containerDescription.State.Health.Status)
continue
}

// Container started and finished with exit code 0
if containerDescription.State.Status == "exited" && containerDescription.State.ExitCode == 0 {
logger.Debugf("Container %s status: %s (exit code: %d)", containerDescription.ID, containerDescription.State.Status, containerDescription.State.ExitCode)
continue
}

// Container exited with code > 0
if containerDescription.State.Status == "exited" && containerDescription.State.ExitCode > 0 {
logger.Debugf("Container %s status: %s (exit code: %d)", containerDescription.ID, containerDescription.State.Status, containerDescription.State.ExitCode)
return fmt.Errorf("container (ID: %s) exited with code %d", containerDescription.ID, containerDescription.State.ExitCode)
}

// Any different status is considered unhealthy
logger.Debugf("Container %s status: unhealthy", containerDescription.ID)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Examples of logs withthese changes:

2024/06/13 19:32:35 DEBUG Wait for healthy containers: 8f38b80e7a6b5638fd3e53bc250f49f5c2e27fa10810de506e4495d7169d4a5c
2024/06/13 19:32:35 DEBUG output command: /usr/bin/docker inspect 8f38b80e7a6b5638fd3e53bc250f49f5c2e27fa10810de506e4495d7169d4a5c
2024/06/13 19:32:35 DEBUG Container 8f38b80e7a6b5638fd3e53bc250f49f5c2e27fa10810de506e4495d7169d4a5c status: unhealthy
2024/06/13 19:32:36 DEBUG Wait for healthy containers: 8f38b80e7a6b5638fd3e53bc250f49f5c2e27fa10810de506e4495d7169d4a5c
2024/06/13 19:32:36 DEBUG output command: /usr/bin/docker inspect 8f38b80e7a6b5638fd3e53bc250f49f5c2e27fa10810de506e4495d7169d4a5c
2024/06/13 19:32:36 DEBUG Container 8f38b80e7a6b5638fd3e53bc250f49f5c2e27fa10810de506e4495d7169d4a5c status: running (health: healthy)

Previously:

2024/06/13 15:43:06 DEBUG Wait for healthy containers: 2306f9dd160408ce2e26b72565d1d3037153de14167052733f2cfd553f212de4
2024/06/13 15:43:06 DEBUG output command: /usr/bin/docker inspect 2306f9dd160408ce2e26b72565d1d3037153de14167052733f2cfd553f212de4
2024/06/13 15:43:06 DEBUG Container status: {"Config":{"Image":"elastic-package-service-87356-nginx","Labels":{"com.docker.compose.config-hash":"07b2d6403034e11414161c6489c787924e32ab36e7d12c69b977e184cf40e3c3","com.docker.compose.container-number":"1","com.docker.compose.depends_on":"","com.docker.compose.image":"sha256:106ba62762b92ccdde0edf49e09063ee28a3be98e9342dfcd3980314b0e4c192","com.docker.compose.oneoff":"False","com.docker.compose.project":"elastic-package-service-87356","com.docker.compose.project.config_files":"/opt/buildkite-agent/builds/bk-agent-prod-gcp-1718292959927056914/elastic/elastic-package/test/packages/parallel/nginx/_dev/deploy/docker/docker-compose.yml","com.docker.compose.project.working_dir":"/opt/buildkite-agent/builds/bk-agent-prod-gcp-1718292959927056914/elastic/elastic-package/test/packages/parallel/nginx/_dev/deploy/docker","com.docker.compose.service":"nginx","com.docker.compose.version":"2.24.1","maintainer":"NGINX Docker Maintainers \[email protected]\u003e"}},"ID":"2306f9dd160408ce2e26b72565d1d3037153de14167052733f2cfd553f212de4","State":{"Status":"running","ExitCode":0,"Health":{"Status":"healthy","Log":[{"Start":"2024-06-13T15:43:05.983549466Z","ExitCode":0,"Output":"  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n                                 Dload  Upload   Total   Spent    Left  Speed\n\r  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0\r100    97  100    97    0     0  97000      0 --:--:-- --:--:-- --:--:-- 97000\nActive connections: 1 \nserver accepts handled requests\n 1 1 1 \nReading: 0 Writing: 1 Waiting: 0 \n"}]}}}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe these logs are too verbose? But we can leave them and update later when/if we add the trace debug level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean to keep the previous log messages ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean to leave it as is now in the PR, and reduce later when we have more log levels.

@@ -118,7 +118,8 @@ func (c *Client) waitUntilPolicyAssigned(ctx context.Context, a Agent, p Policy)
if err != nil {
return fmt.Errorf("can't get the agent: %w", err)
}
logger.Debugf("Agent data: %s", agent.String())
logger.Debugf("Agent %s (Host: %s): Policy ID %s LogLevel: %s Status: %s",
Copy link
Contributor Author

@mrodm mrodm Jun 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably it could be improved better these logs (removing some field?)

Example of new logs:

2024/06/13 19:33:21 DEBUG Wait until the policy (ID: 1355408f-6518-4bbd-8b39-cded15bba61d, revision: 2) is assigned to the agent (ID: ca20edcb-a929-434d-b226-9138be510e80)...
2024/06/13 19:33:23 DEBUG GET https://127.0.0.1:5601/api/fleet/agents/ca20edcb-a929-434d-b226-9138be510e80
2024/06/13 19:33:23 DEBUG Agent ca20edcb-a929-434d-b226-9138be510e80 (Host: elastic-agent-12113): Policy ID 1355408f-6518-4bbd-8b39-cded15bba61d LogLevel: debug Status: updating

Previously:

2024/06/13 15:43:12 DEBUG Wait until the policy (ID: c9f48c54-c30e-4bb7-a17a-a66f6c1776b0, revision: 2) is assigned to the agent (ID: 8bf156ea-30b2-4714-8e99-d09059f0fe54)...
2024/06/13 15:43:14 DEBUG GET https://127.0.0.1:5601/api/fleet/agents/8bf156ea-30b2-4714-8e99-d09059f0fe54
2024/06/13 15:43:14 DEBUG Agent data: {"id":"8bf156ea-30b2-4714-8e99-d09059f0fe54","policy_id":"c9f48c54-c30e-4bb7-a17a-a66f6c1776b0","local_metadata":{"host":{"name":"docker-fleet-agent"},"elastic":{"agent":{"log_level":"info"}}},"status":"updating"}

Comment on lines 1220 to 1223

if err != nil {
return nil, svcInfo, fmt.Errorf("could not setup service: %w", err)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move this error check, to allow being set the handler ?

@@ -1,6 +1,7 @@
env:
SETUP_GVM_VERSION: 'v0.5.2' # https://github.com/andrewkroh/gvm/issues/44#issuecomment-1013231151
ELASTIC_PACKAGE_COMPOSE_DISABLE_VERBOSE_OUTPUT: "true"
ELASTIC_PACKAGE_MAXIMUM_NUMBER_PARALLEL_TESTS: 3
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If package has enabled system tests in parallel, those tests will be triggered using 3 routines as maximum.

go.mod Outdated
Comment on lines 187 to 188

replace github.com/elastic/package-spec/v3 => github.com/mrodm/package-spec/v3 v3.0.0-20240613145150-8d065e83d217
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requires a new version of package-spec

@mrodm mrodm marked this pull request as ready for review June 17, 2024 17:34
@mrodm mrodm requested a review from a team as a code owner June 17, 2024 17:34
Copy link
Member

@jsoriano jsoriano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, also tested locally and works as expected.
Commenting on some of the open questions.

Comment on lines +376 to +399
logger.Debugf("Container %s status: %s (no health status)", containerDescription.ID, containerDescription.State.Status)
continue
}

// Service is up and running and it's healthy
if containerDescription.State.Status == "running" && containerDescription.State.Health.Status == "healthy" {
logger.Debugf("Container %s status: %s (health: %s)", containerDescription.ID, containerDescription.State.Status, containerDescription.State.Health.Status)
continue
}

// Container started and finished with exit code 0
if containerDescription.State.Status == "exited" && containerDescription.State.ExitCode == 0 {
logger.Debugf("Container %s status: %s (exit code: %d)", containerDescription.ID, containerDescription.State.Status, containerDescription.State.ExitCode)
continue
}

// Container exited with code > 0
if containerDescription.State.Status == "exited" && containerDescription.State.ExitCode > 0 {
logger.Debugf("Container %s status: %s (exit code: %d)", containerDescription.ID, containerDescription.State.Status, containerDescription.State.ExitCode)
return fmt.Errorf("container (ID: %s) exited with code %d", containerDescription.ID, containerDescription.State.ExitCode)
}

// Any different status is considered unhealthy
logger.Debugf("Container %s status: unhealthy", containerDescription.ID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe these logs are too verbose? But we can leave them and update later when/if we add the trace debug level.

internal/testrunner/runners/policy/tester.go Outdated Show resolved Hide resolved

partial, err := r.runTest(ctx, testConfig, svcInfo)

tdErr := r.tearDownTest(ctx)
if err != nil {
return partial, err
return partial, errors.Join(err, fmt.Errorf("failed to tear down runner: %w", tdErr))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Errors caused by interruptions are not shown by intention, they use to be long chains of wrapped errors not very useful.

We capture interruptions here:

elastic-package/main.go

Lines 27 to 30 in ebb1446

if errIsInterruption(err) {
rootCmd.Println("interrupted")
os.Exit(130)
}

If both the test and the tear down fail, I think we are fine with returning only the test error.

internal/testrunner/testrunner.go Outdated Show resolved Hide resolved
internal/testrunner/testrunner.go Outdated Show resolved Hide resolved
internal/testrunner/testrunner.go Outdated Show resolved Hide resolved
internal/testrunner/testrunner.go Show resolved Hide resolved
internal/testrunner/testrunner.go Show resolved Hide resolved
internal/testrunner/testrunner.go Show resolved Hide resolved
)

type testConfig struct {
testrunner.SkippableConfig `config:",inline"`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the option to skip individual policy tests.

@mrodm mrodm requested a review from jsoriano June 18, 2024 15:52
Copy link
Member

@jsoriano jsoriano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great 👍

internal/testrunner/testrunner.go Outdated Show resolved Hide resolved
Comment on lines +376 to +399
logger.Debugf("Container %s status: %s (no health status)", containerDescription.ID, containerDescription.State.Status)
continue
}

// Service is up and running and it's healthy
if containerDescription.State.Status == "running" && containerDescription.State.Health.Status == "healthy" {
logger.Debugf("Container %s status: %s (health: %s)", containerDescription.ID, containerDescription.State.Status, containerDescription.State.Health.Status)
continue
}

// Container started and finished with exit code 0
if containerDescription.State.Status == "exited" && containerDescription.State.ExitCode == 0 {
logger.Debugf("Container %s status: %s (exit code: %d)", containerDescription.ID, containerDescription.State.Status, containerDescription.State.ExitCode)
continue
}

// Container exited with code > 0
if containerDescription.State.Status == "exited" && containerDescription.State.ExitCode > 0 {
logger.Debugf("Container %s status: %s (exit code: %d)", containerDescription.ID, containerDescription.State.Status, containerDescription.State.ExitCode)
return fmt.Errorf("container (ID: %s) exited with code %d", containerDescription.ID, containerDescription.State.ExitCode)
}

// Any different status is considered unhealthy
logger.Debugf("Container %s status: unhealthy", containerDescription.ID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean to leave it as is now in the PR, and reduce later when we have more log levels.

@elasticmachine
Copy link
Collaborator

elasticmachine commented Jun 19, 2024

💛 Build succeeded, but was flaky

Failed CI Steps

History

cc @mrodm

@mrodm mrodm merged commit 700dd69 into elastic:main Jun 19, 2024
3 checks passed
@mrodm mrodm deleted the system-tests-parallel branch June 19, 2024 19:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants