"Too many operations in progress" in `FrontJsonFapi` fetching #26335

rtyley · 2023-07-20T12:05:34Z

There have been several recent incidents (13/07/23: retro, timeline)where the serving of fronts (eg https://www.theguardian.com/uk) to users became slow, and in fact failed.

There's a lot of things going on in these incidents, but in particular we do see these TooManyOperationsInProgress errors in the logs:

Failed rendering world with concurrent.FutureSemaphore$TooManyOperationsInProgress: Too many operations in progress, 
cannot execute task
concurrent.FutureSemaphore$TooManyOperationsInProgress: Too many operations in progress, cannot execute task
	at concurrent.FutureSemaphore.execute(FutureSemaphore.scala:16)
	at services.fronts.FrontJsonFapi.parsePressedPage(FrontJsonFapi.scala:34)
	at services.fronts.FrontJsonFapi.$anonfun$pressedPageFromS3$2(FrontJsonFapi.scala:58)

The code in FrontJsonFapi where this issue is occurring is interesting, because it has two implementation details that are worth revisiting: it's non-async, and it uses a semaphore to put a hard limit on how many operations can occur concurrently.

frontend/common/app/services/fronts/FrontJsonFapi.scala

Lines 31 to 60 in 3ad37c6

    
           private def parsePressedPage( 
        
               jsonStringOpt: Option[String], 
        
           )(implicit executionContext: ExecutionContext): Future[Option[PressedPage]] = 
        
             futureSemaphore.execute { 
        
               blockingOperations.executeBlocking { 
        
                 jsonStringOpt.map { jsonString => 
        
                   DurationMetric.withMetrics(FaciaPressMetrics.FrontDecodingLatency) { 
        
                     // This operation is run in the thread pool since it is very CPU intensive 
        
                     Json.parse(jsonString).as[PressedPage] 
        
                   } 
        
                 } 
        
               } 
        
             } 
        
           private def loadPressedPageFromS3(path: String) = 
        
             blockingOperations.executeBlocking { 
        
               DurationMetric.withMetrics(FaciaPressMetrics.FrontDownloadLatency) { 
        
                 S3.getGzipped(path) 
        
               } 
        
             } 
        
           private def pressedPageFromS3( 
        
               path: String, 
        
           )(implicit executionContext: ExecutionContext): Future[Option[PressedPage]] = 
        
             errorLoggingF(s"FrontJsonFapi.pressedPageFromS3 $path") { 
        
               for { 
        
                 s3FrontData <- loadPressedPageFromS3(path) 
        
                 pressedPage <- parsePressedPage(s3FrontData) 
        
               } yield pressedPage 
        
             }

Limit the number of parallel facia presses #18331 - 27th November 2017 : introduced futureSemaphore, limiting the number of concurrent operations to 10 per server
Remove s3 WSClient. Replace with S3 API #18359 - 30th November 2017 : introduced the use of blockingOperations.executeBlocking to FrontJsonFapi - non-async, thread-blocking code

The text was updated successfully, but these errors were encountered:

We believe that the limit was too low. There are 2 instances of this class sharing 128 threads, so it can probably be higher. See #26335 and #18331 for the original change.

Context: #26335 In this change: * Use AWS SDK v2 and non-blocking async code to avoid blocking a thread while downloading the Facia JSON data from S3 * Directly transform bytes into JsValue, without constructing a `String` first Note that the work done between the two metrics `FrontDownloadLatency` & `FrontDecodingLatency` has changed slightly - the conversion to the basic JSON model (JsValue) now occurs in `FrontDownloadLatency`, rather than the 'decoding' step. We could get rid of the futureSemaphore and stop using the dedicated blocking-operations pool here, but we'll leave that to another PR. Co-authored-by: Ravi <[email protected]> Co-authored-by: Ioanna Kokkini <[email protected]>

rtyley · 2023-07-27T14:16:53Z

We've had 2 TooManyOperationsInProgress outages recently (ELK logs):

13th July : 12:35 - 12:55 (20 minutes) - 49662 errors (peaking at 5133 requests per minute) over just 356 paths Initially there were 3 facia app-servers running (this was immediately after a deploy, facia scaled up to 12 across the course of the outage)
19th July : 9:35 - 9:57 (22 minutes) - 58577 errors (peaking at 4618 requests per minute)

I've exported the data for these outages out of the ELK, just trying to understand what paths are being requested

The rate of 5133 errors per minute is really high, given that fronts have a CDN cache time of 1 minute, and that only 356 different paths were being requested (or at least erroring) in the outage - the number of requests is 5133/356 = 14 times what an optimal request rate would be. In other words, if requests had been succeeding, and getting cached, there would have been 14x less traffic - but they were failing, being throttled by futureSemaphore.

parsePressedPage() takes 140ms when it's healthy, meaning the upper limit on a single healthy app server with the futureSemaphore 32-concurrent-actions limit is 13,700 per minute. During the outage, parsePressedPage() took 1937ms as it contested CPU with GC, giving an upper limit of 990 per minute - well below the incoming request rate - it couldn't catch up without 6 app servers running at the very least.

rtyley · 2023-07-28T15:13:22Z

I described a possible failure mode for Facia here: #26336 (comment) - basically, futureSemaphore would throw away 'download' work already done (ie downloading entire S3 files to Strings in memory) because of it's max-32-concurrent-actions limit on decoding, rather than holding on to that work and completing it so that a successful result could be returned to the CDN and cached - lack of CDN caching meant that many requests were repeated, and this would probably have led to increased GC pressure.

This graph shows how many downloads vs decodes Facia was doing. When the server was healthy, these were the same number - when things go wrong, you can see we are doing many more downloads than decodes - all that the extra download work is achieving is generating more garbage collection & consuming threads from the blockingOperations thread pool.

Context: #26335 In this change: * Use AWS SDK v2 and non-blocking async code to avoid blocking a thread while downloading the Facia JSON data from S3 * Directly transform bytes into JsValue, without constructing a `String` first Note that the work done between the two metrics `FrontDownloadLatency` & `FrontDecodingLatency` has changed slightly - the conversion to the basic JSON model (JsValue) now occurs in `FrontDownloadLatency`, rather than the 'decoding' step. We could get rid of the futureSemaphore and stop using the dedicated blocking-operations pool here, but we'll leave that to another PR. Co-authored-by: Ravi <[email protected]> Co-authored-by: Ioanna Kokkini <[email protected]>

* Switch to async AWS SDK v2 for Facia JSON download Context: #26335 In this change: * Use AWS SDK v2 and non-blocking async code to avoid blocking a thread while downloading the Facia JSON data from S3 * Directly transform bytes into JsValue, without constructing a `String` first Note that the work done between the two metrics `FrontDownloadLatency` & `FrontDecodingLatency` has changed slightly - the conversion to the basic JSON model (JsValue) now occurs in `FrontDownloadLatency`, rather than the 'decoding' step. We could get rid of the futureSemaphore and stop using the dedicated blocking-operations pool here, but we'll leave that to another PR. Co-authored-by: Ravi <[email protected]> Co-authored-by: Ioanna Kokkini <[email protected]> * Reduce CPU & network consumption of Facia JSON download This change introduces ETag-based caching with the new https://github.com/guardian/etag-caching library, yielding these benefits: * Significant savings in terms of resources: * **Network**: `PressedPage`s are only re-downloaded if the S3 content has _changed_ * **CPU**: Cached `PressedPage`s are only re-parsed if the S3 content has _changed_ * Retains the 'stay current' behaviour of the old, non-caching solution: S3 queried with every request to ensure the ETag is up-to-date. * Also, given the cache is based on the ([S](https://github.com/blemale/scaffeine))[Caffeine](https://github.com/ben-manes/caffeine) caching library: * In-flight requests for a given key are unified, so 100 simultaneous requests for a key make just **1** fetch-and-parse Currently `facia` runs on [`c6g.2xlarge`](https://instances.vantage.sh/aws/ec2/c6g.2xlarge) instances (16GB RAM), with a JVM heap of 12GB. I'm going to suggest that it's reasonable for the `ETagCache` to consume up to 4GB of the 12GB RAM. There are ~300 fronts, each of which can have [4 different variants](https://github.com/guardian/frontend/blob/f64c5b681f53a9c87ae1e76c529d08b3ad16ef6b/common/app/model/PressedPage.scala#L68-L86), so currently there could be up to 1200 different `PressedPage` objects. From [heapdump](https://drive.google.com/file/d/1yjfqLMpWqww6-3L8RBN0yrAf9hQHHpSC/view?usp=sharing) analysis, the largest `PressedPage` retains ~22MB of memory (most instances are smaller, averaging at 4MB). With the 4GB budget, and assuming a worse case of 22MB per `PressedPage`, we can afford to set a max size on the cache of ~180 entries. Although it's disappointing we can't hold _all_ of the `PressedPage`s, the [priorities of the eviction policy used by the Caffeine caching library](https://github.com/ben-manes/caffeine/wiki/Efficiency) should ensure that we get a good hit rate on the most in-demand fronts. _Incidentally, heapdump analysis also shows that some structures within `PressedPage` are memory inefficient when considered from the perspective of holding many `PressedPage` objects in memory at once - using object pooling on `model.Tag` for instance, would probably lead to a 80% reduction of the total retained memory._ --------- Co-authored-by: George B <[email protected]> Co-authored-by: Ravi <[email protected]> Co-authored-by: Ioanna Kokkini <[email protected]>

ETag Caching was introduced for Facia `PressedPage` JSON downloading with #26338 in order to improve scalability and address #26335, but a limiting factor was the number of `PressedPage` objects that could be stored in the cache. With a max `PressedPage` size of 22MB and a memory budget of 4GB, a cautious max cache size limit of only 180 `PressedPage` objects was set. As a result, the cache hit rate was relatively low, and we saw elevated GC, probably because of object continually being evicted out of the small cache: #26338 (comment) The change in this new commit dramatically reduces the combined size of the `PressedPage` objects held in memory, taking the average retained size per `PressedPage` from 4MB to 0.5MB (based on a sample of 125 `PressedPage` objects held in memory at the same time). It does this by deduplicating the `Tag` objects held by the `PressedPage`s. Previously, as the `Tag`s for different `PressedPage`s were deserialised from JSON, many identical tags would created over and over again, and held in memory. After dedeuplication, those different `PressedPage`s will all reference the _same_ `Tag` object for a given tag. The deduplication is done as the `Tag`s are deserialised - a new cache (gotta love caches!) holds `Tag`s keyed by their hashcode and tag id, and if a new `Tag` is created with a matching key, it's thrown away, and the old one is used instead. Thus we end up with just one instance of that `Tag`, instead of many duplicated ones. See also: * https://en.wikipedia.org/wiki/String_interning - a similar technique used by Java for Strings: https://www.geeksforgeeks.org/interning-of-string/

georgeblahblah mentioned this issue Jul 20, 2023

Increase parallel json presses from 32 to 64 #26336

Closed

rtyley mentioned this issue Jul 20, 2023

Reduce CPU & network consumption of Facia JSON download #26338

Merged

sophie-macmillan closed this as completed in #26338 Aug 9, 2023

rtyley mentioned this issue Aug 10, 2023

Lower memory requirements for cached pressed pages #26477

Merged

ParisaTork mentioned this issue Aug 11, 2023

optimisation: Phases >1 (CDK) guardian/dotcom-rendering#8556

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Too many operations in progress" in `FrontJsonFapi` fetching #26335

"Too many operations in progress" in `FrontJsonFapi` fetching #26335

rtyley commented Jul 20, 2023 •

edited

Loading

rtyley commented Jul 27, 2023 •

edited

Loading

rtyley commented Jul 28, 2023 •

edited

Loading

"Too many operations in progress" in FrontJsonFapi fetching #26335

"Too many operations in progress" in FrontJsonFapi fetching #26335

Comments

rtyley commented Jul 20, 2023 • edited Loading

rtyley commented Jul 27, 2023 • edited Loading

rtyley commented Jul 28, 2023 • edited Loading

"Too many operations in progress" in `FrontJsonFapi` fetching #26335

"Too many operations in progress" in `FrontJsonFapi` fetching #26335

rtyley commented Jul 20, 2023 •

edited

Loading

rtyley commented Jul 27, 2023 •

edited

Loading

rtyley commented Jul 28, 2023 •

edited

Loading