chore: increase timeout for Ceph #618

mdelapenya · 2021-01-19T16:32:16Z

What does this PR do?

We are increasing the timeout when waiting for the Ceph integration to be up-and-running, as the Beats team already do in https://github.com/elastic/beats/blob/ef6274d0d1e36308a333cbed69846a1bd63528ae/metricbeat/module/ceph/mgr_osd_tree/mgr_osd_tree_integration_test.go#L35. Because testcontainers is not supporting (yet) a waitFor strategy (it's about to be merged), we have to sleep the thread. And we are increasing the timeout to 2 minutes only for Ceph, the rest of the services stay at 30 seconds.

In between, we refactored the Sleep method to simplify its usage, using time.Duration instead of string as input parameter.

Why is it important?

As a result of this PR, we expect to reduce the flakiness detected by the E2E for Cep, causing noise in Beats PRs, which needed to be re-triggered.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have run the Unit tests for the CLI, and they are passing locally
I have run the End-2-End tests for the suite I'm working on, and they are passing locally
I have noticed new Go dependencies (run make notice in the proper directory)

How to test this PR locally

SUITE="metricbeat" TAGS="integrations && ceph"  TIMEOUT_FACTOR=3 DEVELOPER_MODE=true LOG_LEVEL=TRACE make -C e2e functional-test

Related issues

Closes Investigate why Ceph tests are flaky, as it seems to fail with quite a frequency #607

Follow-ups

Once testcontainers adds the functionality to wait for compose services, we will be able to consume it in a nicer way, instead of blocking the thread by 2 minutes. See testcontainers/testcontainers-go#261

An example of how it would be:

compose.WithExposedService(
    "nginx_1", 80,
    wait.NewHTTPStrategy("/").
        WithPort("80/tcp").
        WithStartupTimeout(10*time.Second))

elasticmachine · 2021-01-19T16:56:07Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Build Cause: Pull request #618 opened
- Start Time: 2021-01-19T16:32:31.434+0000
Duration: 23 min 29 sec
Commit: c6b7652

Test stats 🧪

Test	Results
Failed	0
Passed	105
Skipped	9
Total	114

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test	Results
Failed	0
Passed	105
Skipped	9
Total	114

* chore: simplify sleep parameter, using a Duration * chore: wait longer for Ceph service

mdelapenya · 2021-01-20T16:17:30Z

@jsoriano I'm still seeing some failures in Ceph tests:

[2021-01-20T14:29:32.458Z] time="2021-01-20T14:29:31Z" level=error msg="Error Hit found" ID=PuIzIHcBlvID7Xz7NoyS error.message="error in fetch: HTTP error 404 in : 404 NOT FOUND"
[2021-01-20T14:29:32.458Z] time="2021-01-20T14:29:31Z" level=error msg="Error Hit found" ID=P-IzIHcBlvID7Xz7NoyS error.message="HTTP error 404 in : 404 NOT FOUND"
[2021-01-20T14:29:32.458Z] time="2021-01-20T14:29:31Z" level=error msg="Error Hit found" ID=QOIzIHcBlvID7Xz7NoyS error.message="error in fetch: HTTP error 404 in : 404 NOT FOUND"
[2021-01-20T14:29:32.458Z] time="2021-01-20T14:29:31Z" level=error msg="Error Hit found" ID=QeIzIHcBlvID7Xz7Oox4 error.message="error in fetch: HTTP error 404 in : 404 NOT FOUND"
[2021-01-20T14:29:32.458Z] time="2021-01-20T14:29:31Z" level=error msg="Error Hit found" ID=QuIzIHcBlvID7Xz7Oox4 error.message="error in fetch: HTTP error 404 in : 404 NOT FOUND"
[2021-01-20T14:29:32.458Z] time="2021-01-20T14:29:31Z" level=error msg="Error Hit found" ID=Q-IzIHcBlvID7Xz7Oox4 error.message="HTTP error 404 in : 404 NOT FOUND"
[2021-01-20T14:29:32.458Z] time="2021-01-20T14:29:31Z" level=error msg="Error Hit found" ID=R-IzIHcBlvID7Xz7Q4zq error.message="error in fetch: HTTP error 404 in : 404 NOT FOUND"
[2021-01-20T14:29:32.458Z] time="2021-01-20T14:29:31Z" level=error msg="Error Hit found" ID=SOIzIHcBlvID7Xz7Q4zq error.message="HTTP error 404 in : 404 NOT FOUND"
[2021-01-20T14:29:32.458Z] time="2021-01-20T14:29:31Z" level=error msg="Error Hit found" ID=SeIzIHcBlvID7Xz7Q4zq error.message="error in fetch: HTTP error 404 in : 404 NOT FOUND"
[2021-01-20T14:29:32.458Z] time="2021-01-20T14:29:31Z" level=error msg="Error Hit found" ID=SuIzIHcBlvID7Xz7Q4zq error.message="HTTP error 404 in : 404 NOT FOUND"

Are you familiar with the error in fetch: HTTP error 404 in : 404 NOT FOUND error? You can find the container logs for the test execution here: https://beats-ci.elastic.co/job/e2e-tests/job/e2e-testing-mbp/job/master/180/artifact/docker_logs_metricbeat_integrations%20&&%20ceph.log

jsoriano · 2021-01-20T18:46:52Z

@mdelapenya investigating this I have seen that the metricsets in the Ceph module query two different endpoints, some metricsets work with the "Ceph REST API", and some others with the "Ceph Manager Daemon". I guess that querying the incorrect endpoint will result in 404s.

In the reference configuration they are separated and query different endpoints: https://www.elastic.co/guide/en/beats/metricbeat/7.10/metricbeat-module-ceph.html#_example_configuration_8

The docker compose scenario starts two containers, one called ceph, and another one called ceph-api. Integration tests in Beats explicitly use one or the other service depending on the one they support:

osd_df collects metrics from the REST API: https://github.com/elastic/beats/blob/c6af0496569b2ea95e7325e2711496b29e831405/metricbeat/module/ceph/osd_df/osd_df_integration_test.go#L30
mgr_cluster_disk collects metrics from the manager daemon: https://github.com/elastic/beats/blob/c6af0496569b2ea95e7325e2711496b29e831405/metricbeat/module/ceph/mgr_cluster_disk/mgr_cluster_disk_integration_test.go#L35

System tests only support the mgr_* metricsets: https://github.com/elastic/beats/blob/c6af0496569b2ea95e7325e2711496b29e831405/metricbeat/module/ceph/test_ceph.py#L33-L40

As this seems quite special, and I guess this module is not so heavily used, I'd suggest to skip tests for this module by now and open a follow up issue. Unless you can think in a straightforward way of supporting this here.

mdelapenya · 2021-01-20T20:41:42Z

What I'm concerned about is the tests do pass many many times. But how? It seems it's not a timeout issue (we also increased the timeout_factor in CI from 3 to 5 -fleet reasons), so I'd say there is something weird:

the containers are not properly started. I'm taking a look at the original docker-compose Vs the massaged one:

version: '2.3'

services:
  ceph:
    image: docker.elastic.co/integrations-ci/beats-ceph:${CEPH_VERSION:-master-97985eb-nautilus-centos-7-x86_64}-2
    build:
      context: ./_meta
      dockerfile: Dockerfile.${CEPH_CODENAME:-nautilus}
      args:
        CEPH_VERSION: ${CEPH_VERSION:-master-97985eb-nautilus-centos-7-x86_64}
    ports:
      - 5000
      - 8003
      - 8080
  ceph-api:
    image: docker.elastic.co/integrations-ci/beats-ceph:master-6373c6a-jewel-centos-7-x86_64-1
    build:
      context: ./_meta
      dockerfile: Dockerfile.jewel
      args:
        CEPH_VERSION: master-6373c6a-jewel-centos-7-x86_64
    ports:
      - 5000

massaged one:

version: "2.3"
services:
  ceph:
    image: docker.elastic.co/integrations-ci/beats-ceph:master-6373c6a-jewel-centos-7-x86_64-1
    ports:
    - 5000
  ceph-api:
    image: docker.elastic.co/integrations-ci/beats-ceph:master-6373c6a-jewel-centos-7-x86_64-1
    ports:
    - 5000

For some reason all ports are not extracted back to the new one. I'll debug that process, but we do not do any magic: simply using these structs to load the YAML file:

type service interface{}
type compose struct {
	Version  string             `yaml:"version"`
	Services map[string]service `yaml:"services"`
}

... 
        c := compose{}
	err = yaml.Unmarshal(bytes, &c)
...

mdelapenya · 2021-01-20T20:44:15Z

Wait, is this correct? https://github.com/elastic/beats/blob/master/metricbeat/module/ceph/docker-compose.yml#L5-L16

Is the ceph-api using a hardcoded version for master-6373c6a-jewel?

mdelapenya · 2021-01-20T21:40:29Z

https://github.com/elastic/beats/blob/master/metricbeat/module/ceph/docker-compose.yml#L5-L16

@jsoriano I think ceph-api compose file is wrong, probably caused by elastic/beats@273c6a8:
See https://github.com/elastic/beats/blame/master/metricbeat/module/ceph/docker-compose.yml#L5-L16

I'm going to send a fix there 🙏 review 😄

jsoriano · 2021-01-21T12:01:27Z

@jsoriano I think ceph-api compose file is wrong, probably caused by elastic/beats@273c6a8:
See https://github.com/elastic/beats/blame/master/metricbeat/module/ceph/docker-compose.yml#L5-L16

This conversation seems relevant: elastic/beats#16254 (comment) It seems that we decided to delay some things.

mdelapenya added 2 commits January 19, 2021 17:21

chore: simplify sleep parameter, using a Duration

8678dda

chore: wait longer for Ceph service

c6b7652

mdelapenya self-assigned this Jan 19, 2021

mdelapenya requested a review from a team January 19, 2021 16:32

mdelapenya changed the title ~~~~ chore: increase timeout for Ceph Jan 19, 2021

mdelapenya requested a review from jsoriano January 19, 2021 16:35

mdelapenya added Flaky Test metricbeat labels Jan 19, 2021

mdelapenya marked this pull request as ready for review January 19, 2021 17:00

cachedout approved these changes Jan 19, 2021

View reviewed changes

v1v approved these changes Jan 19, 2021

View reviewed changes

jsoriano approved these changes Jan 19, 2021

View reviewed changes

mdelapenya merged commit 5ea7866 into elastic:master Jan 19, 2021

mdelapenya added a commit to mdelapenya/e2e-testing that referenced this pull request Jan 19, 2021

chore: increase timeout for Ceph (elastic#618)

9f8e870

* chore: simplify sleep parameter, using a Duration * chore: wait longer for Ceph service

mdelapenya mentioned this pull request Jan 19, 2021

chore: increase timeout for Ceph (#618) backport for 7.x #619

Merged

mdelapenya added a commit to mdelapenya/e2e-testing that referenced this pull request Jan 19, 2021

chore: increase timeout for Ceph (elastic#618)

18fafdb

* chore: simplify sleep parameter, using a Duration * chore: wait longer for Ceph service

mdelapenya mentioned this pull request Jan 19, 2021

chore: increase timeout for Ceph (#618) backport for 7.11.x #620

Merged

mdelapenya added a commit to mdelapenya/e2e-testing that referenced this pull request Jan 19, 2021

chore: increase timeout for Ceph (elastic#618)

ae2b7e4

* chore: simplify sleep parameter, using a Duration * chore: wait longer for Ceph service

mdelapenya mentioned this pull request Jan 19, 2021

chore: increase timeout for Ceph (#618) backport for 7.10.x #621

Merged

mdelapenya added a commit to mdelapenya/e2e-testing that referenced this pull request Jan 19, 2021

chore: increase timeout for Ceph (elastic#618)

125c0ab

* chore: simplify sleep parameter, using a Duration * chore: wait longer for Ceph service

mdelapenya mentioned this pull request Jan 19, 2021

chore: increase timeout for Ceph (#618) backport for 6.8.x #622

Merged

mdelapenya deleted the 607-integrations-timeout branch January 19, 2021 17:17

mdelapenya added a commit that referenced this pull request Jan 19, 2021

chore: increase timeout for Ceph (#618) (#619)

e28c29e

* chore: simplify sleep parameter, using a Duration * chore: wait longer for Ceph service

mdelapenya added a commit that referenced this pull request Jan 19, 2021

chore: increase timeout for Ceph (#618) (#620)

1552364

* chore: simplify sleep parameter, using a Duration * chore: wait longer for Ceph service

mdelapenya added a commit that referenced this pull request Jan 19, 2021

chore: increase timeout for Ceph (#618) (#621)

3b25d3a

* chore: simplify sleep parameter, using a Duration * chore: wait longer for Ceph service

mdelapenya added a commit that referenced this pull request Jan 19, 2021

chore: increase timeout for Ceph (#618) (#622)

2256c52

* chore: simplify sleep parameter, using a Duration * chore: wait longer for Ceph service

mdelapenya mentioned this pull request Jan 20, 2021

fix: do not override compose configurations when syncing from beats #630

Merged

9 tasks

mdelapenya mentioned this pull request Jan 20, 2021

fix: use supported-versions variables in ceph-api elastic/beats#23602

Closed

6 tasks

mdelapenya mentioned this pull request Oct 26, 2021

chore: bump elastic-package to v0.25.0 #1721

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: increase timeout for Ceph #618

chore: increase timeout for Ceph #618

mdelapenya commented Jan 19, 2021 •

edited

Loading

elasticmachine commented Jan 19, 2021

Build stats

Test stats 🧪

Test stats 🧪

mdelapenya commented Jan 20, 2021

jsoriano commented Jan 20, 2021

mdelapenya commented Jan 20, 2021

mdelapenya commented Jan 20, 2021

mdelapenya commented Jan 20, 2021

jsoriano commented Jan 21, 2021

chore: increase timeout for Ceph #618

chore: increase timeout for Ceph #618

Conversation

mdelapenya commented Jan 19, 2021 • edited Loading

What does this PR do?

Why is it important?

Checklist

How to test this PR locally

Related issues

Follow-ups

elasticmachine commented Jan 19, 2021

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

Test stats 🧪

mdelapenya commented Jan 20, 2021

jsoriano commented Jan 20, 2021

mdelapenya commented Jan 20, 2021

mdelapenya commented Jan 20, 2021

mdelapenya commented Jan 20, 2021

jsoriano commented Jan 21, 2021

mdelapenya commented Jan 19, 2021 •

edited

Loading