HDDS-5382. Increase default container report interval to 60 mins #2363

sadanand48 · 2021-06-24T11:17:55Z

What changes were proposed in this pull request?

During scale testing of ozone with 350k+ containers and nearly 1 million replica reports it was observed that, there is a sudden burst in SCM heap usage . In HDFS, the full block report interval is 6 hours by default and in between, there are incremental block reports. Similarly, there are incremental reports in SCM . Setting the full container report interval to 1 hour make things quite stable as determined from tests and 60s for full report seems very aggressive.

Increase default container report interval to 60 mins .

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-5382

How was this patch tested?

only a config change

guihecheng · 2021-06-25T02:55:22Z

Hi @sadanand48 , sorry to disturb, but I happen to see this and can't help wonder why the default report intervals are changed? I can't find any descriptions in the related jira either, could you please explain a bit? thanks~

bshashikant · 2021-06-25T04:08:57Z

@guihecheng , during scale testing of ozone with 350k+ containers and nearly 1 million replica reports it was observed that, there is a sudden burst in scm heap usage . In hdfs, the fill block report interval is 6 hours by default and in between, there are incremental bock reports. Similarly, there are incremental reports in SCM . Setting the full container report interval to 1 hour make things quite stable as determined from tests and 60s for full report seems very aggressive.

bshashikant · 2021-06-25T04:10:42Z

@sadanand48 , please change the default container and pipeline report interval to 60s in acceotance tests and in MiniOzoneCluster so that there are no random/intermittent failures bcoz of this.

guihecheng · 2021-06-25T06:04:17Z

Thanks @bshashikant for the explanation, this is very useful pr for a production deployment.

ChenSammi · 2021-06-28T08:21:58Z

@sadanand48 , can you elaborate a bit why pipeline report interval is changed to 60m? If I remember correctly, today, Pipeline Manager will use this pipeline report interval as the timeout to wait for a pipeline be ready for serve wright.

JacksonYao287 · 2021-07-01T04:54:45Z

maybe for now , we could just only change container full report interval , which is a heavy burden of scm.

bshashikant · 2021-07-05T10:00:51Z

@sadanand48 , can you elaborate a bit why pipeline report interval is changed to 60m? If I remember correctly, today, Pipeline Manager will use this pipeline report interval as the timeout to wait for a pipeline be ready for serve wright.

I think, the pipeline report is sent on heartbeat as soon as a pipeline is created or pipeline is closed and destroyed. There is no dependency on the pipeline report interval as such for opening up the pipeline for write. Full report intervals can be sent a higher interval.
cc ~ @nandakumar131

bshashikant · 2021-07-12T10:03:36Z

@nandakumar131 , can you also have a look?

jojochuang · 2021-07-27T03:30:44Z

I support increasing the default interval. @ChenSammi feel free to let us know if something is still holding back.

mukul1987

+1, LGTM

mukul1987

Thanks for changing this to only the container report interval.

ChenSammi · 2021-07-29T05:53:31Z

The last patch LGTM, +1.

* master: (48 commits) HDDS-5514. Skip check for UNHEALTHY containers for datanode finalize. (apache#2469) HDDS-5279. OFS mkdir -p does not work when Volume is not pre-created (apache#2412) HDDS-5328. Remove delete container command from admin CLI (apache#2456) HDDS-5382. Increase default container report interval to 60 mins (apache#2363) HDDS-5378 Add APIs to retrieve Namespace Summary from Recon (apache#2417) HDDS-5466. Refactor BlockOutputStream. (apache#2442) HDDS-5465. Delete redundant code when set、add and remove bucket acl (apache#2439) HDDS-5184. Use separate DB profile for Datanodes. (apache#2214) HDDS-5494. Reduce retry in Kubernetes test (apache#2461) HDDS-5414. Data buffers incorrectly filtered for Ozone Insight (apache#2387) HDDS-5450. Avoid refresh pipeline for S3 headObject (apache#2431) HDDS-5500. New k3s version breaks kubernetes test (apache#2464) HDDS-5489. Install OS-specific flekszible (apache#2462) Multi-raft style placement with permutations for offline data generator (apache#2434) HDDS-5484. Intermittent failure in TestReplicationManager#testMovePrerequisites (apache#2454) HDDS-5443 Create and then recreate a bucket with a randomized name (apache#2436) HDDS-5492. Disable failing kubernetes test (apache#2459) HDDS-4330. Bootstrap new OM node (apache#1494) HDDS-5418. Let Recon send reregisterCommand to Datanodes if DatanodeDetails changed (apache#2392) HDDS-5479. s3g bucket list failed when there is non-english key name. (apache#2450) ...

Sadanand Shenoy added 2 commits June 24, 2021 16:45

Increase default container report interval to 60 mins

ca36a83

trigger new CI check

1a34530

Sadanand Shenoy added 3 commits June 25, 2021 22:45

addressed comment

57aaf32

added config to MiniOzoneCluster

14d2423

fix checkstyle

7b8fe36

bshashikant approved these changes Jul 12, 2021

View reviewed changes

mukul1987 approved these changes Jul 28, 2021

View reviewed changes

sadanand48 and others added 6 commits July 28, 2021 14:12

Merge branch 'apache:master' into HDDS-5382

d5a1f61

revert pipeline report interval config change

e459182

remove empty line

c7c7064

fix test failure

7853809

fix checkstyle

4d62f17

trigger new CI check

20694a3

mukul1987 approved these changes Jul 29, 2021

View reviewed changes

mukul1987 merged commit 074d8f4 into apache:master Jul 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-5382. Increase default container report interval to 60 mins #2363

HDDS-5382. Increase default container report interval to 60 mins #2363

sadanand48 commented Jun 24, 2021 •

edited

Loading

guihecheng commented Jun 25, 2021

bshashikant commented Jun 25, 2021 •

edited

Loading

bshashikant commented Jun 25, 2021

guihecheng commented Jun 25, 2021

ChenSammi commented Jun 28, 2021

JacksonYao287 commented Jul 1, 2021

bshashikant commented Jul 5, 2021

bshashikant commented Jul 12, 2021

jojochuang commented Jul 27, 2021

mukul1987 left a comment

mukul1987 left a comment

ChenSammi commented Jul 29, 2021

HDDS-5382. Increase default container report interval to 60 mins #2363

HDDS-5382. Increase default container report interval to 60 mins #2363

Conversation

sadanand48 commented Jun 24, 2021 • edited Loading

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

guihecheng commented Jun 25, 2021

bshashikant commented Jun 25, 2021 • edited Loading

bshashikant commented Jun 25, 2021

guihecheng commented Jun 25, 2021

ChenSammi commented Jun 28, 2021

JacksonYao287 commented Jul 1, 2021

bshashikant commented Jul 5, 2021

bshashikant commented Jul 12, 2021

jojochuang commented Jul 27, 2021

mukul1987 left a comment

Choose a reason for hiding this comment

mukul1987 left a comment

Choose a reason for hiding this comment

ChenSammi commented Jul 29, 2021

sadanand48 commented Jun 24, 2021 •

edited

Loading

bshashikant commented Jun 25, 2021 •

edited

Loading