-
Notifications
You must be signed in to change notification settings - Fork 556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Too many operations in progress" in FrontJsonFapi
fetching
#26335
Comments
Context: #26335 In this change: * Use AWS SDK v2 and non-blocking async code to avoid blocking a thread while downloading the Facia JSON data from S3 * Directly transform bytes into JsValue, without constructing a `String` first Note that the work done between the two metrics `FrontDownloadLatency` & `FrontDecodingLatency` has changed slightly - the conversion to the basic JSON model (JsValue) now occurs in `FrontDownloadLatency`, rather than the 'decoding' step. We could get rid of the futureSemaphore and stop using the dedicated blocking-operations pool here, but we'll leave that to another PR. Co-authored-by: Ravi <[email protected]> Co-authored-by: Ioanna Kokkini <[email protected]>
We've had 2
I've exported the data for these outages out of the ELK, just trying to understand what paths are being requested The rate of 5133 errors per minute is really high, given that fronts have a CDN cache time of 1 minute, and that only 356 different paths were being requested (or at least erroring) in the outage - the number of requests is 5133/356 = 14 times what an optimal request rate would be. In other words, if requests had been succeeding, and getting cached, there would have been 14x less traffic - but they were failing, being throttled by
|
I described a possible failure mode for Facia here: #26336 (comment) - basically, This graph shows how many |
Context: #26335 In this change: * Use AWS SDK v2 and non-blocking async code to avoid blocking a thread while downloading the Facia JSON data from S3 * Directly transform bytes into JsValue, without constructing a `String` first Note that the work done between the two metrics `FrontDownloadLatency` & `FrontDecodingLatency` has changed slightly - the conversion to the basic JSON model (JsValue) now occurs in `FrontDownloadLatency`, rather than the 'decoding' step. We could get rid of the futureSemaphore and stop using the dedicated blocking-operations pool here, but we'll leave that to another PR. Co-authored-by: Ravi <[email protected]> Co-authored-by: Ioanna Kokkini <[email protected]>
Context: #26335 In this change: * Use AWS SDK v2 and non-blocking async code to avoid blocking a thread while downloading the Facia JSON data from S3 * Directly transform bytes into JsValue, without constructing a `String` first Note that the work done between the two metrics `FrontDownloadLatency` & `FrontDecodingLatency` has changed slightly - the conversion to the basic JSON model (JsValue) now occurs in `FrontDownloadLatency`, rather than the 'decoding' step. We could get rid of the futureSemaphore and stop using the dedicated blocking-operations pool here, but we'll leave that to another PR. Co-authored-by: Ravi <[email protected]> Co-authored-by: Ioanna Kokkini <[email protected]>
Context: #26335 In this change: * Use AWS SDK v2 and non-blocking async code to avoid blocking a thread while downloading the Facia JSON data from S3 * Directly transform bytes into JsValue, without constructing a `String` first Note that the work done between the two metrics `FrontDownloadLatency` & `FrontDecodingLatency` has changed slightly - the conversion to the basic JSON model (JsValue) now occurs in `FrontDownloadLatency`, rather than the 'decoding' step. We could get rid of the futureSemaphore and stop using the dedicated blocking-operations pool here, but we'll leave that to another PR. Co-authored-by: Ravi <[email protected]> Co-authored-by: Ioanna Kokkini <[email protected]>
Context: #26335 In this change: * Use AWS SDK v2 and non-blocking async code to avoid blocking a thread while downloading the Facia JSON data from S3 * Directly transform bytes into JsValue, without constructing a `String` first Note that the work done between the two metrics `FrontDownloadLatency` & `FrontDecodingLatency` has changed slightly - the conversion to the basic JSON model (JsValue) now occurs in `FrontDownloadLatency`, rather than the 'decoding' step. We could get rid of the futureSemaphore and stop using the dedicated blocking-operations pool here, but we'll leave that to another PR. Co-authored-by: Ravi <[email protected]> Co-authored-by: Ioanna Kokkini <[email protected]>
Context: #26335 In this change: * Use AWS SDK v2 and non-blocking async code to avoid blocking a thread while downloading the Facia JSON data from S3 * Directly transform bytes into JsValue, without constructing a `String` first Note that the work done between the two metrics `FrontDownloadLatency` & `FrontDecodingLatency` has changed slightly - the conversion to the basic JSON model (JsValue) now occurs in `FrontDownloadLatency`, rather than the 'decoding' step. We could get rid of the futureSemaphore and stop using the dedicated blocking-operations pool here, but we'll leave that to another PR. Co-authored-by: Ravi <[email protected]> Co-authored-by: Ioanna Kokkini <[email protected]>
Context: #26335 In this change: * Use AWS SDK v2 and non-blocking async code to avoid blocking a thread while downloading the Facia JSON data from S3 * Directly transform bytes into JsValue, without constructing a `String` first Note that the work done between the two metrics `FrontDownloadLatency` & `FrontDecodingLatency` has changed slightly - the conversion to the basic JSON model (JsValue) now occurs in `FrontDownloadLatency`, rather than the 'decoding' step. We could get rid of the futureSemaphore and stop using the dedicated blocking-operations pool here, but we'll leave that to another PR. Co-authored-by: Ravi <[email protected]> Co-authored-by: Ioanna Kokkini <[email protected]>
* Switch to async AWS SDK v2 for Facia JSON download Context: #26335 In this change: * Use AWS SDK v2 and non-blocking async code to avoid blocking a thread while downloading the Facia JSON data from S3 * Directly transform bytes into JsValue, without constructing a `String` first Note that the work done between the two metrics `FrontDownloadLatency` & `FrontDecodingLatency` has changed slightly - the conversion to the basic JSON model (JsValue) now occurs in `FrontDownloadLatency`, rather than the 'decoding' step. We could get rid of the futureSemaphore and stop using the dedicated blocking-operations pool here, but we'll leave that to another PR. Co-authored-by: Ravi <[email protected]> Co-authored-by: Ioanna Kokkini <[email protected]> * Reduce CPU & network consumption of Facia JSON download This change introduces ETag-based caching with the new https://github.com/guardian/etag-caching library, yielding these benefits: * Significant savings in terms of resources: * **Network**: `PressedPage`s are only re-downloaded if the S3 content has _changed_ * **CPU**: Cached `PressedPage`s are only re-parsed if the S3 content has _changed_ * Retains the 'stay current' behaviour of the old, non-caching solution: S3 queried with every request to ensure the ETag is up-to-date. * Also, given the cache is based on the ([S](https://github.com/blemale/scaffeine))[Caffeine](https://github.com/ben-manes/caffeine) caching library: * In-flight requests for a given key are unified, so 100 simultaneous requests for a key make just **1** fetch-and-parse Currently `facia` runs on [`c6g.2xlarge`](https://instances.vantage.sh/aws/ec2/c6g.2xlarge) instances (16GB RAM), with a JVM heap of 12GB. I'm going to suggest that it's reasonable for the `ETagCache` to consume up to 4GB of the 12GB RAM. There are ~300 fronts, each of which can have [4 different variants](https://github.com/guardian/frontend/blob/f64c5b681f53a9c87ae1e76c529d08b3ad16ef6b/common/app/model/PressedPage.scala#L68-L86), so currently there could be up to 1200 different `PressedPage` objects. From [heapdump](https://drive.google.com/file/d/1yjfqLMpWqww6-3L8RBN0yrAf9hQHHpSC/view?usp=sharing) analysis, the largest `PressedPage` retains ~22MB of memory (most instances are smaller, averaging at 4MB). With the 4GB budget, and assuming a worse case of 22MB per `PressedPage`, we can afford to set a max size on the cache of ~180 entries. Although it's disappointing we can't hold _all_ of the `PressedPage`s, the [priorities of the eviction policy used by the Caffeine caching library](https://github.com/ben-manes/caffeine/wiki/Efficiency) should ensure that we get a good hit rate on the most in-demand fronts. _Incidentally, heapdump analysis also shows that some structures within `PressedPage` are memory inefficient when considered from the perspective of holding many `PressedPage` objects in memory at once - using object pooling on `model.Tag` for instance, would probably lead to a 80% reduction of the total retained memory._ --------- Co-authored-by: George B <[email protected]> Co-authored-by: Ravi <[email protected]> Co-authored-by: Ioanna Kokkini <[email protected]>
ETag Caching was introduced for Facia `PressedPage` JSON downloading with #26338 in order to improve scalability and address #26335, but a limiting factor was the number of `PressedPage` objects that could be stored in the cache. With a max `PressedPage` size of 22MB and a memory budget of 4GB, a cautious max cache size limit of only 180 `PressedPage` objects was set. As a result, the cache hit rate was relatively low, and we saw elevated GC, probably because of object continually being evicted out of the small cache: #26338 (comment) The change in this new commit dramatically reduces the combined size of the `PressedPage` objects held in memory, taking the average retained size per `PressedPage` from 4MB to 0.5MB (based on a sample of 125 `PressedPage` objects held in memory at the same time). It does this by deduplicating the `Tag` objects held by the `PressedPage`s. Previously, as the `Tag`s for different `PressedPage`s were deserialised from JSON, many identical tags would created over and over again, and held in memory. After dedeuplication, those different `PressedPage`s will all reference the _same_ `Tag` object for a given tag. The deduplication is done as the `Tag`s are deserialised - a new cache (gotta love caches!) holds `Tag`s keyed by their hashcode and tag id, and if a new `Tag` is created with a matching key, it's thrown away, and the old one is used instead. Thus we end up with just one instance of that `Tag`, instead of many duplicated ones. See also: * https://en.wikipedia.org/wiki/String_interning - a similar technique used by Java for Strings: https://www.geeksforgeeks.org/interning-of-string/
ETag Caching was introduced for Facia `PressedPage` JSON downloading with #26338 in order to improve scalability and address #26335, but a limiting factor was the number of `PressedPage` objects that could be stored in the cache. With a max `PressedPage` size of 22MB and a memory budget of 4GB, a cautious max cache size limit of only 180 `PressedPage` objects was set. As a result, the cache hit rate was relatively low, and we saw elevated GC, probably because of object continually being evicted out of the small cache: #26338 (comment) The change in this new commit dramatically reduces the combined size of the `PressedPage` objects held in memory, taking the average retained size per `PressedPage` from 4MB to 0.5MB (based on a sample of 125 `PressedPage` objects held in memory at the same time). It does this by deduplicating the `Tag` objects held by the `PressedPage`s. Previously, as the `Tag`s for different `PressedPage`s were deserialised from JSON, many identical tags would created over and over again, and held in memory. After dedeuplication, those different `PressedPage`s will all reference the _same_ `Tag` object for a given tag. The deduplication is done as the `Tag`s are deserialised - a new cache (gotta love caches!) holds `Tag`s keyed by their hashcode and tag id, and if a new `Tag` is created with a matching key, it's thrown away, and the old one is used instead. Thus we end up with just one instance of that `Tag`, instead of many duplicated ones. See also: * https://en.wikipedia.org/wiki/String_interning - a similar technique used by Java for Strings: https://www.geeksforgeeks.org/interning-of-string/
ETag Caching was introduced for Facia `PressedPage` JSON downloading with #26338 in order to improve scalability and address #26335, but a limiting factor was the number of `PressedPage` objects that could be stored in the cache. With a max `PressedPage` size of 22MB and a memory budget of 4GB, a cautious max cache size limit of only 180 `PressedPage` objects was set. As a result, the cache hit rate was relatively low, and we saw elevated GC, probably because of object continually being evicted out of the small cache: #26338 (comment) The change in this new commit dramatically reduces the combined size of the `PressedPage` objects held in memory, taking the average retained size per `PressedPage` from 4MB to 0.5MB (based on a sample of 125 `PressedPage` objects held in memory at the same time). It does this by deduplicating the `Tag` objects held by the `PressedPage`s. Previously, as the `Tag`s for different `PressedPage`s were deserialised from JSON, many identical tags would created over and over again, and held in memory. After dedeuplication, those different `PressedPage`s will all reference the _same_ `Tag` object for a given tag. The deduplication is done as the `Tag`s are deserialised - a new cache (gotta love caches!) holds `Tag`s keyed by their hashcode and tag id, and if a new `Tag` is created with a matching key, it's thrown away, and the old one is used instead. Thus we end up with just one instance of that `Tag`, instead of many duplicated ones. See also: * https://en.wikipedia.org/wiki/String_interning - a similar technique used by Java for Strings: https://www.geeksforgeeks.org/interning-of-string/
There have been several recent incidents (13/07/23: retro, timeline)where the serving of fronts (eg https://www.theguardian.com/uk) to users became slow, and in fact failed.
There's a lot of things going on in these incidents, but in particular we do see these
TooManyOperationsInProgress
errors in the logs:The code in
FrontJsonFapi
where this issue is occurring is interesting, because it has two implementation details that are worth revisiting: it's non-async, and it uses a semaphore to put a hard limit on how many operations can occur concurrently.frontend/common/app/services/fronts/FrontJsonFapi.scala
Lines 31 to 60 in 3ad37c6
futureSemaphore
, limiting the number of concurrent operations to 10 per serverblockingOperations.executeBlocking
toFrontJsonFapi
- non-async, thread-blocking codeThe text was updated successfully, but these errors were encountered: