Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce API response times by 30% by using memcache migration flag #5239

Merged

Conversation

AlanCoding
Copy link
Member

SUMMARY

When profiling times taken in requests (or just ever looking at the debug toolbar output), the migration middleware came up as a big resource consumer. It was tricky to pin down, because it wasn't much if tested with isolated requests.

I did timings with the debug settings off and just logged middleware times and did some typical UI use. I found general numbers of ~40% of the time was spent in middleware and ~30% was spent in the migration middleware alone.

This is not hard to reduce, but this PR still needs some more testing via those prior means and verification of continued functionality.

ISSUE TYPE
  • Feature Pull Request
COMPONENT NAME
  • API
AWX VERSION
9.0.1
ADDITIONAL INFORMATION

@ryanpetrello
Copy link
Contributor

ryanpetrello commented Nov 5, 2019

~30% was spent in the migration middleware alone.

Do you have some examples? When we talk about 30%, what is the actual amount total time saved in seconds on a fairly naive request (i.e., are we talking hundredths of a second per request, or tenths?) Are we talking about requests that don't otherwise have database access (e.g., GET /api/v2/)?

return redirect(reverse("ui:migrations_notran"))
if cache.get('migration_in_progress', False):
executor = MigrationExecutor(connection)
plan = executor.migration_plan(executor.loader.graph.leaf_nodes())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I take it this is getting to be fairly expensive per-request?

awx/main/apps.py Outdated
from django.utils.translation import ugettext_lazy as _


def raise_migration_flag(**kwargs):
from awx.main.tasks import set_migration_flag
set_migration_flag()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably actually want set_migration_flag.apply_async() so that it gets broadcast everywhere, right?

Copy link
Contributor

@ryanpetrello ryanpetrello left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm interesting in chatting about numbers, but I really like this idea in general if we can show that the migration planner is contributing notable overhead to certain requests.

@ryanpetrello
Copy link
Contributor

image

@softwarefactory-project-zuul
Copy link
Contributor

Build failed.

@AlanCoding
Copy link
Member Author

Oh I forgot the other thing that's still TBD on this.

Hopefully reasonable statement: we should not run the task manager while migrations are running.

It's probably benign either way, because right now the task manager will run and just rollback in the event of errors, until it finishes and it completes. But there's a reasonable user experience case to show a log "migration is running" versus ERROR ERROR ERROR.

@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded.

@AlanCoding
Copy link
Member Author

In spite of substantial hardware differences, I actually get almost the same number for that timing, as low as 6 seconds.

Here is the branch I used for the timings previously, which I just rebased:

https://github.com/ansible/awx/compare/devel...AlanCoding:middleware_timings?expand=1

I extremely consistently get higher numbers using that method. When I first did this my number was 157 ms. I re-tested just now over a slightly longer time frame and got 200 ms. This sounds suspiciously high, I know.

This is why I was doing the more complicated task of bracketing the middleware methods - because those numbers remained stubbornly higher than when tested in isolation. I cannot tell you exactly why, my speculation falls short of justifying the large magnitude of the difference. Your 6 second number averages out to 63 ms, which isn't exactly small either when considering that it applies to all requests.

@ryanpetrello
Copy link
Contributor

ryanpetrello commented Nov 5, 2019

Even if we're not talking about saving multiple tenths of seconds, I do think this change is simple enough, and worth it. It's always bothered me that we rebuild the migration executor on every single request just for the purpose of showing an upgrade screen.

I like the idea of setting a flag in the cache once migrations start, and having the local instances unset their flag as requests land there and they discover migration is done.

@ryanpetrello
Copy link
Contributor

Hopefully reasonable statement: we should not run the task manager while migrations are running.

I suppose it would be reasonable to add a check to the task manager's startup. This should definitely lean on the same memcached flag check, though.

@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded.

@AlanCoding AlanCoding marked this pull request as ready for review November 12, 2019 04:15
@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded.

@AlanCoding
Copy link
Member Author

did some basic manual testing. This shows the same behavior as we have now, that if you migrate back 1, the migration screen will never go away. Also observed the log messages related to the task manager.

Filed #5302 related to this.

@AlanCoding
Copy link
Member Author

I combined this content with my timing stuff

https://github.com/ansible/awx/compare/devel...AlanCoding:migration_cache_verification?expand=1

and got some updated numbers.

{
  "total": 0.3692765372923051,
  "request": 1.916118051813937e-05,
  "MigrationRanCheckMiddleware_request": 0.019325396110271585,
  "ActivityStreamMiddleware_request": 0.006098179981626314,
  "URLModificationMiddleware_request": 4.861820703265311e-05,
  "SessionTimeoutMiddleware_request": 0.0005964273693917811,
  "URLModificationMiddleware": 1.2784168638032058e-05,
  "ActivityStreamMiddleware": 0.0014353072506257858
}

that's for 87 data points. This is not as good as I had hoped at 0.02 seconds, but much improved over 0.3 I was seeing before with same hardware & settings.

Anyway, I think I'm finished at this point. Checking these numbers was my last TODO item.

@AlanCoding
Copy link
Member Author

@ryanpetrello are you in favor of merge?

@ryanpetrello
Copy link
Contributor

Oh, I thought I merged this. Yep, let's merge it.

@softwarefactory-project-zuul
Copy link
Contributor

Build failed (gate pipeline). For information on how to proceed, see
http://docs.openstack.org/infra/manual/developers.html#automated-testing

@wenottingham
Copy link
Contributor

regate

@wenottingham
Copy link
Contributor

(needs rebase before merging)

@softwarefactory-project-zuul
Copy link
Contributor

Build failed (gate pipeline). For information on how to proceed, see
http://docs.openstack.org/infra/manual/developers.html#automated-testing

@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded.

@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded (gate pipeline).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants