Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(sql): Optimize executions #4804

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

kirangodishala
Copy link
Contributor

@kirangodishala kirangodishala commented Nov 24, 2024

This PR improves performance when using sql as the backend by optimizing the way the results are obtained for the api call /applications/{application}/pipelines?expand=false&limit=2 frequently initiated by deck.

Deck makes that call to Gate. Gate further forwards that call to Orca, specifically to /v2/applications/{application}/pipelines endpoint.

Orca calls front50 to get the pipeline and strategy config ids and then merges them. Then for each id in the merged list, it queries its own db (SQL) to handle the query parameters, such as limit=2 and filter by statuses. It is observed that these db operations are taking longer times and this PR fixes the slowness by optimizing the queries to the db.

Refactored TaskController to support externalized config properties

Optimize getPipelinesForApplication() by reducing the number of relevant pipeline config ids to query

Further optimize getPipelinesForApplication() by processing multiple pipeline config ids at a time and multi-threading each config id batch

Breaking changes:

These configuration properties

tasks:
   days-of-execution-history:  
   number-of-old-pipeline-executions-to-include: 

are replaced by the below configuration along with few more newly added ones:

tasks:
   controller:
      days-of-execution-history: 
      number-of-old-pipeline-executions-to-include: 

      optimize-execution-retrieval: <boolean>
      max-execution-retrieval-threads:
      max-number-of-pipeline-executions-to-process: 
      execution-retrieval-timeout-seconds: 

@kirangodishala kirangodishala changed the title Optimize executions feat(sql): Optimize executions Nov 25, 2024
orca-web/orca-web.gradle Outdated Show resolved Hide resolved
@kirangodishala kirangodishala force-pushed the optimize_executions branch 2 times, most recently from fb4fd18 to 40fd05d Compare November 28, 2024 17:20
@Nonnull pipelineConfigIds: List<String>,
@Nonnull criteria: ExecutionCriteria
): List<String> {
return primary.retrieveAndFilterPipelineExecutionIdsForApplication(application, pipelineConfigIds, criteria) +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MINOR: BUT ... this ordering DOES matter on the selection criteria and don't think we have a test for this (aka primary MUST come first or it'd select previous data potentially). Just an observation that this is missing a test .. but not sure how "critical" that test really is.

var baseQueryPredicate = field("application").eq(application)
.and(field("config_id").`in`(*pipelineConfigIds.toTypedArray()))

var table = if (jooq.dialect() == SQLDialect.MYSQL) PIPELINE.tableName.forceIndex("pipeline_application_idx")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: DO we really want to force an index, or is another index on the combination needed? OR perhaps merging a couple of these indexes. NOT sure best solution but... we have a number of indexes with some of this data and wondering if we couldn't do this a touch cleaner. I'd EXPECT the optimizer to handle MOST of this without the need to force an index.


// need to define a new executor service since we want a dedicated set of threads to be made available for every
// new request for performance reasons
ExecutorService executorService = Executors.newFixedThreadPool(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gotta admin that having this in the request path generating a new thread pool PER request seems... dangerous though a bigger question might be what's the downstream impact on the DB, and the impact on running concurrent executions on the DB? Since these are all hitting the RW pod as a rule right now. I could see this adding a tremendous amount of load on the database as a result...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not quite a thread-per-request since the number of threads is set by a config prop (default 10).

#4803 comes next, with more to follow as we move requests to a read-only database replica.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah saw the properties but generating a thread pool on the request even WITH limits is still backed by a queue ... It's still shutdown but this could absolutely hammer a database badly because of a potential exponential call graph.
E.g. a browser can repeat calls which then would make new calls on timeouts that'd overload this badly. The RO PR should help A LOT with that however, and IN THEORY this should improve performance. It's just a concern based upon how I've seen these API's get called.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. We also ended up making changes in deck to improve things. Those are gonna be harder for us to upstream, but we'll try.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And yes, massive pipelines cause lots of load.

* <p>It can be further tuned, depending on your setup, since 150 executions work well for some
* applications but a higher number may be appropriate for others.
*/
int maxNumberOfPipelineExecutionsToProcess = 150;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A concern is this is probably TOO high depending upon capacity for "smaller" installations - as is probably the max execution threads. IF you allow for 3 concurrent API calls at a time, that's 50 threads that get spawned to meet this need, which is potentially 30 separate threads hitting the database. That's a LOT of concurrency against the database & databse connections & could impact pipeline operations as a result. And you don't want to make your connection pools TOO big (beyond CPU capacity) or it actually gets WORSE.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a configurable value and matters only when optimizeExecutionRetrieval = true which by default is false. The idea is that this feature is enabled only when the pipelines and/executions are high.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would still be nice if the defaults are useful for more folks. What do you suggest here @jasonmcintosh ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ponder so NORMAL recommendations on pool/concurrency size is tied to CPU's. I think most people run on smaller instances with limited cores, and with other background processes... I'd just go to smaller defaults? Probably maxExecutionRetrievalThreads set to 4 maybe. shrug it's a hard one to judge - m5.2xlarge has only 8 cpus (which... i'll ignore hyperthreading) so you could hit some concurrency issues. It's not IMPOSSIBLE... I just tend to vote "smaller, more of them" kinda stuff on this.

@christosarvanitis
Copy link
Member

The whole idea seems like a good optimisation to me 👍 @dbyron-sf @kirangodishala how long have you been running this in your forks?

Also could we fit the same concept of optimisations in retrievePipelinesForPipelineConfigIdsBetweenBuildTimeBoundary in this PR?

.queryTimeout(queryTimeoutSeconds) // add an explicit timeout so that the query doesn't run forever
.fetch()

log.info("getting stage information for all the executions found so far")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem valuable enough to leave at info.

Suggested change
log.info("getting stage information for all the executions found so far")
log.debug("getting stage information for all the executions found so far")

if (!expand) {
unexpandPipelineExecutions(allPipelines)
log.info("unexpanding pipeline executions")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log.info("unexpanding pipeline executions")
log.debug("unexpanding pipeline executions")

}

return filterPipelinesByHistoryCutoff(allPipelines, limit)
log.info("filtering pipelines by history")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log.info("filtering pipelines by history")
log.debug("filtering pipelines by history")

kirangodishala and others added 7 commits December 13, 2024 22:12
…perties

optimize getPipelinesForApplication() by reducing the number of relevant pipeline config ids to query

This is a breaking change:
TaskController previously looked at these properties:
`tasks.days-of-execution-history`
`tasks.number-of-old-pipeline-executions-to-include`
and now they live in `tasks.controller`.

fix(web): give java code access to groovy
to fix below error:

> Task :orca-web:compileJava FAILED
../orca/orca-web/src/main/java/com/netflix/spinnaker/config/TaskControllerConfigurationProperties.java:19: error: cannot find symbol
import com.netflix.spinnaker.orca.controllers.TaskController;
                                             ^
  symbol:   class TaskController
  location: package com.netflix.spinnaker.orca.controllers
1 error
…ion() further by processing multiple pipeline config ids at a time and multi-threading each config id batch
refactor function and variable names to make them more descriptive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants