Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR throughput max out #3875

Closed
jbrown-xentity opened this issue Jun 30, 2022 · 7 comments
Closed

SOLR throughput max out #3875

jbrown-xentity opened this issue Jun 30, 2022 · 7 comments
Assignees

Comments

@jbrown-xentity
Copy link
Contributor

SOLR can hit the max throughput with EFS and cause all operations (read and write) to be slow/unresponsive, causing site outages. We believe this occurs with the harvest gather process.

https://docs.aws.amazon.com/efs/latest/ug/performance.html

How to reproduce

  1. Harvest a large DCAT-US file, like NASA or Commerce

Expected behavior

SOLR does not crash, and EFS remains available

Actual behavior

SOLR crashed, EFS unresponsive

Sketch

Need to examine throughput, and possibly set our own values different from bursty defaults: https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/efs_file_system

Note: blow it up alternative, consider EBS volumes: https://www.geeksforgeeks.org/difference-between-amazon-ebs-and-amazon-efs/

@FuhuXia
Copy link
Member

FuhuXia commented Jun 30, 2022

image

As shown in the graph, after it hits 100%, scaling catalog-gather to 0 will bring down the throughput utilization %.

@nickumia-reisys
Copy link
Contributor

nickumia-reisys commented Jul 1, 2022

Just mentioning here, we would need a pretty high provisioned throughput since it has peaked at close to 500 MB/s for times in the past,

image

@nickumia-reisys
Copy link
Contributor

This graph validates that it was using ~3.8x the Provisioned IO capacity. The ability to burst is based on the burst credit methodology outlined here, https://docs.aws.amazon.com/efs/latest/ug/performance.html#throughput-modes

image

@nickumia-reisys
Copy link
Contributor

Our burst credits over the last three days,

image

@FuhuXia
Copy link
Member

FuhuXia commented Jul 1, 2022

I have manually changed both catalog-prod and catalog-staging EFS from Burst to Provisioned throughput mode, giving them 150 MiB/s and 100 MiB/s respectively. Let us see how well they hold up under normal web traffic and harvesting activity.

@nickumia-reisys
Copy link
Contributor

Will also mention, the reason we didn't implement EBS volumes is because there isn't a native AWS solution that supports it. It requires the use of third-party docker container driver plugins which is a separate headache to maintain another technology and didn't provide an easy integration for our use-case.

Reference: https://aws.amazon.com/blogs/compute/amazon-ecs-and-docker-volume-drivers-amazon-ebs/

@FuhuXia
Copy link
Member

FuhuXia commented Jul 6, 2022

After we set 100MiB/s Provisioned throughput for prod and staging EFS volumes, no more outage because of throughput. But solr started to give 504 Gateway Time-out error. Looking at EFS metrics, it appears Percent IO limit becomes an issue. It stays above 1 (100%) for extended time when we have load testing ongoing mimicking normal web traffic load. We are changing the Performance modes of EFS from General Purpose to Max I/O mode, as shown in the last three PRs.
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants