Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-5494. Reduce retry in Kubernetes test #2461

Merged
merged 2 commits into from
Jul 28, 2021

Conversation

adoroszlai
Copy link
Contributor

What changes were proposed in this pull request?

kubernetes tests wait for cluster startup, checking some conditions with retry. In worst case all conditions are checked 100 times with 3 seconds delay, so the test may take 15 minutes to fail.

Skip waiting for SCM and OM readiness if retries for previous conditions are exhausted.

https://issues.apache.org/jira/browse/HDDS-5494

How was this patch tested?

Currently the test in ozone env. fails to start the cluster, so the change is verified by the failing CI check:

...
99 'all_pods_are_running' is failed...
4 pods are running out from the 5
100 'all_pods_are_running' is failed...

**** Executing robot tests scm-0 ****

...

https://github.com/adoroszlai/hadoop-ozone/runs/3159798487#step:6:797

The happy path is verified by successful startup in getting-started env.:

...
-1 pods are running. Waiting for more.
12 'all_pods_are_running' is failed...
5 pods are running out from the 6
13 'all_pods_are_running' is failed...
1 'grep_log scm-0 SCM exiting safe mode.' is failed...
2 'grep_log scm-0 SCM exiting safe mode.' is failed...
3 'grep_log scm-0 SCM exiting safe mode.' is failed...
4 'grep_log scm-0 SCM exiting safe mode.' is failed...
5 'grep_log scm-0 SCM exiting safe mode.' is failed...
6 'grep_log scm-0 SCM exiting safe mode.' is failed...
7 'grep_log scm-0 SCM exiting safe mode.' is failed...
8 'grep_log scm-0 SCM exiting safe mode.' is failed...
2021-07-26 09:01:01 INFO  SCMSafeModeManager:248 - SCM exiting safe mode.
2021-07-26 09:01:02 INFO  BaseHttpServer:329 - HTTP server of ozoneManager listening at http://0.0.0.0:9874

**** Cluster is up and running ****

...

https://github.com/adoroszlai/hadoop-ozone/runs/3159798487#step:6:172

@adoroszlai adoroszlai self-assigned this Jul 26, 2021
@GeorgeJahad
Copy link
Contributor

This looks like a useful fix to me. (I ran into the same problem working on this kubernetes PR: #2464)

Copy link
Contributor

@smengcl smengcl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank @adoroszlai for the improvement

@adoroszlai adoroszlai merged commit 5336bb9 into apache:master Jul 28, 2021
@adoroszlai adoroszlai deleted the HDDS-5494 branch July 28, 2021 05:39
@adoroszlai
Copy link
Contributor Author

Thanks @GeorgeJahad and @smengcl for the review.

errose28 added a commit to errose28/ozone that referenced this pull request Jul 30, 2021
* master: (48 commits)
  HDDS-5514. Skip check for UNHEALTHY containers for datanode finalize. (apache#2469)
  HDDS-5279. OFS mkdir -p does not work when Volume is not pre-created (apache#2412)
  HDDS-5328. Remove delete container command from admin CLI (apache#2456)
  HDDS-5382. Increase default container report interval to 60 mins (apache#2363)
  HDDS-5378 Add APIs to retrieve Namespace Summary from Recon (apache#2417)
  HDDS-5466. Refactor BlockOutputStream. (apache#2442)
  HDDS-5465. Delete redundant code when set、add and remove bucket acl (apache#2439)
  HDDS-5184. Use separate DB profile for Datanodes. (apache#2214)
  HDDS-5494. Reduce retry in Kubernetes test (apache#2461)
  HDDS-5414. Data buffers incorrectly filtered for Ozone Insight (apache#2387)
  HDDS-5450. Avoid refresh pipeline for S3 headObject (apache#2431)
  HDDS-5500. New k3s version breaks kubernetes test (apache#2464)
  HDDS-5489. Install OS-specific flekszible (apache#2462)
  Multi-raft style placement with permutations for offline data generator (apache#2434)
  HDDS-5484. Intermittent failure in TestReplicationManager#testMovePrerequisites (apache#2454)
  HDDS-5443 Create and then recreate a bucket with a randomized name (apache#2436)
  HDDS-5492. Disable failing kubernetes test (apache#2459)
  HDDS-4330. Bootstrap new OM node (apache#1494)
  HDDS-5418. Let Recon send reregisterCommand to Datanodes if DatanodeDetails changed (apache#2392)
  HDDS-5479. s3g bucket list failed when there is non-english key name. (apache#2450)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants