Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Events: Optimize enrolls to avoid timeouts #396

Merged
merged 9 commits into from
Oct 21, 2024
Merged

Conversation

martastain
Copy link
Member

@martastain martastain commented Oct 18, 2024

Problems experienced with the enroll endpoint are related to the following:

  • The enroll operation itself is slow
  • When client request timeouts, the underlying operation is not cancelled. Meanwhile clients request the same operation again, which leads to a high load on the server.

What has been done to address the issue so far

  • Various query optimizations have been tried to speed up the enroll operation
  • The endpoint return 503 status code when the DB connection pool is reaching its limit to prevent lockup

What is the next step

Optimization

  • By default do not try to use events older than 3 days. Usually it does not make sense to use older events for processing.

This limit can be raised per request if needed by changing ignoreOlderThan field in the request (the value is integer - number of days). You may use 0 to remove the limit and process the entire queue

Preventing duplicate requests

We could utilize the fact the query continues on the background.

  • Keep track of the processed request and return 503 response if the same request is received again
  • When processing is finished save the response to the cache and return the cached response if the same request is received again.
  • Invalidate cached results based on TTL

Implementation

  1. When a request is received, calculate a hash of the request model.
  2. Check if the hash exists in redis. If it does, return 503 status code.
  3. If the hash does not exist, save the hash to redis with a value of {"status": "processing"} (TTL 1 hour)
  4. Process the request.
  5. If the result of the process is None, delete the key from redis
  6. If the client is still connected, return the response and delete the key from redis
  7. If client is disconnected resave the redis key: {"status": "done", "payload": EnrollResponseModel}

That way when the client re-request the same operation, we can return the cached response.

Testing

slothMode argument to enroll endpoint was introduced. When enabled, database request becomes super slow (using a sleep inside select query) to simulate slow responses. Additionaly, when this mode is used, enroll is more verbose and prints what's happening to the server log (prefixed with a friendly sloth emoji)

image

The following (or similar) script can be used to test the stability of the endpoint:

#!/usr/bin/env python3

import time
import requests

BASE_URL = "http://localhost:5000/api"
headers = {"X-API-KEY": "veryinsecurapikey"}

class EnrollException(Exception):
    pass

def do_enroll():
    data= {
        "sourceTopic": "server.started",
        "targetTopic": "test.target",
        "sender": "fakeservice",
        "slothMode": True,
    }
    try:
        res = requests.post(f"{BASE_URL}/enroll", json=data, headers=headers, timeout=2)
    except requests.exceptions.ReadTimeout:
        raise EnrollException("Timeout")
    except requests.exceptions.RequestException as e:
        raise EnrollException(f"HTTP Error {e}")
    r = res.json()
    if res.status_code == 503:
        raise EnrollException(r["detail"])
    elif res.status_code >= 400:
        raise EnrollException(f"API Error: {r['detail']}")
    return r

while True:
    print()
    print("Enrolling")
    try:
        res = do_enroll()
    except EnrollException as e:
        print(str(e))
        time.sleep(1)
        continue
    print("GOT RESULT", res)
    break

Underlying logic is unchanged, so the result is the same and changes are fully backwards compatible, but please, test with existing services.

@martastain martastain self-assigned this Oct 18, 2024
@martastain martastain added the type: bug Something isn't working label Oct 18, 2024
@martastain martastain merged commit cafabdd into develop Oct 21, 2024
@martastain martastain deleted the enroll-timeouts branch October 21, 2024 13:42
@tanh609
Copy link

tanh609 commented Oct 23, 2024

Thank you @martastain

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants