SOLR throughput max out #3875

jbrown-xentity · 2022-06-30T21:46:16Z

SOLR can hit the max throughput with EFS and cause all operations (read and write) to be slow/unresponsive, causing site outages. We believe this occurs with the harvest gather process.

https://docs.aws.amazon.com/efs/latest/ug/performance.html

How to reproduce

Harvest a large DCAT-US file, like NASA or Commerce

Expected behavior

SOLR does not crash, and EFS remains available

Actual behavior

SOLR crashed, EFS unresponsive

Sketch

Need to examine throughput, and possibly set our own values different from bursty defaults: https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/efs_file_system

Note: blow it up alternative, consider EBS volumes: https://www.geeksforgeeks.org/difference-between-amazon-ebs-and-amazon-efs/

FuhuXia · 2022-06-30T21:49:55Z

As shown in the graph, after it hits 100%, scaling catalog-gather to 0 will bring down the throughput utilization %.

nickumia-reisys · 2022-07-01T13:56:57Z

Just mentioning here, we would need a pretty high provisioned throughput since it has peaked at close to 500 MB/s for times in the past,

nickumia-reisys · 2022-07-01T14:09:10Z

This graph validates that it was using ~3.8x the Provisioned IO capacity. The ability to burst is based on the burst credit methodology outlined here, https://docs.aws.amazon.com/efs/latest/ug/performance.html#throughput-modes

nickumia-reisys · 2022-07-01T14:09:59Z

Our burst credits over the last three days,

FuhuXia · 2022-07-01T14:17:48Z

I have manually changed both catalog-prod and catalog-staging EFS from Burst to Provisioned throughput mode, giving them 150 MiB/s and 100 MiB/s respectively. Let us see how well they hold up under normal web traffic and harvesting activity.

nickumia-reisys · 2022-07-01T14:20:00Z

Will also mention, the reason we didn't implement EBS volumes is because there isn't a native AWS solution that supports it. It requires the use of third-party docker container driver plugins which is a separate headache to maintain another technology and didn't provide an easy integration for our use-case.

Reference: https://aws.amazon.com/blogs/compute/amazon-ecs-and-docker-volume-drivers-amazon-ebs/

FuhuXia · 2022-07-06T20:53:31Z

After we set 100MiB/s Provisioned throughput for prod and staging EFS volumes, no more outage because of throughput. But solr started to give 504 Gateway Time-out error. Looking at EFS metrics, it appears Percent IO limit becomes an issue. It stays above 1 (100%) for extended time when we have load testing ongoing mimicking normal web traffic load. We are changing the Performance modes of EFS from General Purpose to Max I/O mode, as shown in the last three PRs.

nickumia-reisys assigned FuhuXia Jul 5, 2022

This was referenced Jul 6, 2022

EFS - Max IO vs. General Purpose GSA-TTS/datagov-brokerpak-solr#44

Merged

Solr on ECS - EFS - Max IO vs. General Purpose GSA/datagov-ssb#147

Merged

Set Solr EFS to be Max IO Mode + Turn off harvesting (prod) GSA/catalog.data.gov#488

Merged

hkdctol closed this as completed Jul 7, 2022

nickumia-reisys mentioned this issue Jul 20, 2022

Boost EFS performance for Solr-on-ECS #3855

Closed

7 tasks

nickumia-reisys mentioned this issue Sep 15, 2022

Dissect Solr Performance through New Relic #3956

Open

5 tasks

nickumia-reisys mentioned this issue Nov 11, 2022

[Tiny EPIC] Support SOLR standalone on ECS #3826

Closed

39 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SOLR throughput max out #3875

SOLR throughput max out #3875

jbrown-xentity commented Jun 30, 2022

FuhuXia commented Jun 30, 2022

nickumia-reisys commented Jul 1, 2022 •

edited

Loading

nickumia-reisys commented Jul 1, 2022

nickumia-reisys commented Jul 1, 2022

FuhuXia commented Jul 1, 2022

nickumia-reisys commented Jul 1, 2022

FuhuXia commented Jul 6, 2022 •

edited

Loading

SOLR throughput max out #3875

SOLR throughput max out #3875

Comments

jbrown-xentity commented Jun 30, 2022

How to reproduce

Expected behavior

Actual behavior

Sketch

FuhuXia commented Jun 30, 2022

nickumia-reisys commented Jul 1, 2022 • edited Loading

nickumia-reisys commented Jul 1, 2022

nickumia-reisys commented Jul 1, 2022

FuhuXia commented Jul 1, 2022

nickumia-reisys commented Jul 1, 2022

FuhuXia commented Jul 6, 2022 • edited Loading

nickumia-reisys commented Jul 1, 2022 •

edited

Loading

FuhuXia commented Jul 6, 2022 •

edited

Loading