Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Async: Implement circuit breaker for celery tooling #1830

Merged
merged 11 commits into from
Oct 30, 2017

Conversation

alukach
Copy link
Contributor

@alukach alukach commented Oct 3, 2017

Proposed changes in this pull request

The PR applies the circuit breaker pattern to the Export tooling. The idea is that if calls to an external service fails (in this case, if SQS is unreachable) consistently over specified threshold number of times, the circuit breaker enters an 'open' state where subsequent calls are blocked for a specified "cool-off" period. After the cool-off period, another call is permitted, the success of which will dictate if the circuit breaker should remain 'open' or should 'close' (allowing more failures). The idea is that the service will fail fast rather than tie up a number of threads/processes waiting for a non-responsive external service to time out. For more information, see Martin Fowler's blog post.

To achieve this pattern, we need to:

  • Pin our kombu requirement to a commit that includes this not-yet-released PR. This allows us to set a max number of retries for connecting to SQS, preventing the default behaviour of retrying indefinitely. This additionally requires us to add a max_retries configuration to our CELERY_BROKER_TRANSPORT_OPTIONS
  • Create a CircuitBreaker instance to manage connections to SQS. This is based off of the pybreaker library. State of the breaker is stored in our cache, ensuring that it is shared between many instances of the Cadasta Platform (such as the webserver instances spun up by uwsgi, the manage.py sync_tasks process, and multiple servers if we ever go down that route). If our cache is inaccessible (i.e. if memcached goes down or is unreachable for any reason), all circuit breakers are assumed to be 'open' (i.e. enabling communication to the service). When a circuit breaker changes state, it is logged at level error when opened and at level info when closed.
  • Wrap any code that attempts to make SQS network connection with our circuit breaker. This code should catch and pass on either any known exception that a down SQS network would throw or the CircuitBreakerError that is thrown when a breaker is open. To simplify the management of these errors, we've added a expected_errors attribute to the circuit breaker. This provides an object that can be used in the try/catch statement

Additionally, I've set up the system to:

  • Being that we can expect the manage.py sync_tasks command to be running in the background, we have a service that will be constantly attempting to connect to SQS. This means that the state of the circuit breaker should represent (in near-real-time) the accessibility of SQS. I've added an is_open property to our circuit breakers to return this value. The export button on the project dashboard is disabled when is_open returns False, providing the status of the export service to the end-user so as to have them avoid receiving a "export service down" error message.

When should this PR be merged

ASAP, as this is targeted to feature/async

How to test

To test this PR, I recommend developers:

  • Set up their environments with AWS keys (provided they have AWS SQS access)
  • Setup memcached by:
    • Installing memcached:
    sudo apt-get install memcached`
    • Setting the Cadasta-Platform to use memcached:
    CACHES = {
        'default': {
            'BACKEND': 'core.backends.MemorySafePyLibMCCache',
            'LOCATION': '127.0.0.1:11211',
        },
        'jsonattrs': {
            'BACKEND': 'django.core.cache.backends.locmem.LocMemCache',
            'LOCATION': 'jsonattrs'
        }
    }
  • Experiment with enabling and disabling the internet connections while and starting the Cadasta-Platform and scheduling export operations (note: remember to run the cadasta-platform with the SQS=1 environment flag). The following helpers make it easy to disable the internet connection on your VMs:
    disable-internet() {
      sudo /sbin/route del default
    }
    enable-internet() {
      sudo /sbin/route add default gw 10.0.2.2
    }

Risks

  • Being that the circuit breaker state is shared between processes/environments, if one environment/instance can't connect to SQS for some reason (e.g. bad AWS keys), this will open the circuit and block access to SQS for our entire system.

Follow-up actions

See #1400

Checklist (for reviewing)

General

  • Is this PR explained thoroughly? All code changes must be accounted for in the PR description.
    • Review 1
    • Review 2
  • Is the PR labeled correctly? It should have the migration label if a new migration is added.
    • Review 1
    • Review 2
  • Is the risk level assessment sufficient? The risks section should contain all risks that might be introduced with the PR and which actions we need to take to mitigate these risks. Possible risks are database migrations, new libraries that need to be installed or changes to deployment scripts.
    • Review 1
    • Review 2

Functionality

  • Are all requirements met? Compare implemented functionality with the requirements specification.
    • Review 1
    • Review 2
  • Does the UI work as expected? There should be no Javascript errors in the console; all resources should load. There should be no unexpected errors. Deliberately try to break the feature to find out if there are corner cases that are not handled.
    • Review 1
    • Review 2

Code

  • Do you fully understand the introduced changes to the code? If not ask for clarification, it might uncover ways to solve a problem in a more elegant and efficient way.
    • Review 1
    • Review 2
  • Does the PR introduce any inefficient database requests? Use the debug server to check for duplicate requests.
    • Review 1
    • Review 2
  • Are all necessary strings marked for translation? All strings that are exposed to users via the UI must be marked for translation.
    • Review 1
    • Review 2
  • Is the code documented sufficiently? Large and complex classes, functions or methods must be annotated with comments following our code-style guidelines.
    • Review 1
    • Review 2
  • Has the scalability of this change been evaluated?
    • Review 1
    • Review 2
  • Is there a maintenance plan in place?
    • Review 1
    • Review 2

Tests

  • Are there sufficient test cases? Ensure that all components are tested individually; models, forms, and serializers should be tested in isolation even if a test for a view covers these components.
    • Review 1
    • Review 2
  • If this is a bug fix, are tests for the issue in place? There must be a test case for the bug to ensure the issue won’t regress. Make sure that the tests break without the new code to fix the issue.
    • Review 1
    • Review 2
  • If this is a new feature or a significant change to an existing feature? has the manual testing spreadsheet been updated with instructions for manual testing?
    • Review 1
    • Review 2

Security

  • Confirm this PR doesn't commit any keys, passwords, tokens, usernames, or other secrets.
    • Review 1
    • Review 2
  • Are all UI and API inputs run through forms or serializers?
    • Review 1
    • Review 2
  • Are all external inputs validated and sanitized appropriately?
    • Review 1
    • Review 2
  • Does all branching logic have a default case?
    • Review 1
    • Review 2
  • Does this solution handle outliers and edge cases gracefully?
    • Review 1
    • Review 2
  • Are all external communications secured and restricted to SSL?
    • Review 1
    • Review 2

Documentation

  • Are changes to the UI documented in the platform docs? If this PR introduces new platform site functionality or changes existing ones, the changes must be documented in the Cadasta Platform Documentation.
    • Review 1
    • Review 2
  • Are changes to the API documented in the API docs? If this PR introduces new API functionality or changes existing ones, the changes must be documented in the API docs.
    • Review 1
    • Review 2
  • Are reusable components documented? If this PR introduces components that are relevant to other developers (for instance a mixin for a view or a generic form) they should be documented in the Wiki.
    • Review 1
    • Review 2

@alukach alukach closed this Oct 3, 2017
@alukach alukach changed the title IN PROGRESS: Async/circuit breaker Async: Implement circuit breaker for SQS service Oct 4, 2017
@alukach alukach reopened this Oct 4, 2017
@alukach alukach force-pushed the async/circuit-breaker branch from 15c3784 to bba5f7b Compare October 4, 2017 04:46
@alukach alukach changed the title Async: Implement circuit breaker for SQS service Async: Implement circuit breaker for celery tooling Oct 4, 2017
@alukach alukach force-pushed the async/circuit-breaker branch 2 times, most recently from edd3b06 to 5ca3772 Compare October 4, 2017 20:50
@alukach alukach mentioned this pull request Oct 5, 2017
14 tasks
Copy link
Member

@oliverroick oliverroick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I have two comments re your docstrings. They don't seem to be fully up to date.


class CircuitBreakerCacheStorage(pybreaker.CircuitBreakerStorage):
"""
Defines the underlying storage for a circuit breaker - the underlying
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring is not 100% clear. You say "the underlying implementation should be in a subclass that overrides the method this class defines" but CircuitBreakerCacheStorage but there's no subclass anywhere. So I'm assuming that you mean pybreaker.CircuitBreakerStorage needs a subclass, which you're implementing here. Is that correct?


def __init__(self, namespace, fallback_state=pybreaker.STATE_CLOSED):
"""
Creates a new instance with the given `state` and `redis` object. The
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring here doesn't seem to be correct. It looks like the docstring for pybreaker. CircuitRedisStorage; can you update that?

@alukach alukach force-pushed the async/circuit-breaker branch from fb0671c to 5ca3772 Compare October 5, 2017 16:36
@alukach
Copy link
Contributor Author

alukach commented Oct 5, 2017

Nice work! I have two comments re your docstrings. They don't seem to be fully up to date.

You're completely correct, they should be resolved now. Also, when looking at this I realized that there was no logic to ensure that circuitbreakers would not have overlapping namespaces. I've added this in the last commit.

@alukach alukach force-pushed the async/circuit-breaker branch from 17439a9 to 51d59b9 Compare October 5, 2017 17:11
@alukach alukach force-pushed the async/circuit-breaker branch from 1f53b5d to 47a97d3 Compare October 17, 2017 00:33
@CLAassistant
Copy link

CLAassistant commented Oct 17, 2017

CLA assistant check
All committers have signed the CLA.

@alukach alukach force-pushed the async/circuit-breaker branch from e2f6628 to 0f7a99c Compare October 17, 2017 02:41
@alukach alukach force-pushed the async/circuit-breaker branch from 417bea8 to d0797b8 Compare October 18, 2017 01:25
Copy link
Contributor

@seav seav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the really long delay. I will admit that I did not fully understand everything because I don't have much familiarity with the packages used. But for the things that I could understand, there were no code smells or defects that I can see.

Note also that I wasn't able to actually test the circuit breaking function because I don't have AWS credentials.

@alukach alukach merged commit 9e81dd1 into feature/async Oct 30, 2017
alukach added a commit that referenced this pull request Nov 8, 2017
* Fixup docstring, ensure no duplicate namespace

* Defer queue setup to workertoolbox

* Add circuit breaker for celery tooling

* Cleanup Celery queue prefix settings

* Ensure all tasks are found

* Add sync task to vagrant machine

* Fixup

* Fixup

* Ensure logging helper is disabled

* Fix factory to add real task

* Update field type
alukach added a commit that referenced this pull request Nov 13, 2017
* Fixup docstring, ensure no duplicate namespace

* Defer queue setup to workertoolbox

* Add circuit breaker for celery tooling

* Cleanup Celery queue prefix settings

* Ensure all tasks are found

* Add sync task to vagrant machine

* Fixup

* Fixup

* Ensure logging helper is disabled

* Fix factory to add real task

* Update field type
alukach added a commit that referenced this pull request Nov 19, 2017
* Fixup docstring, ensure no duplicate namespace

* Defer queue setup to workertoolbox

* Add circuit breaker for celery tooling

* Cleanup Celery queue prefix settings

* Ensure all tasks are found

* Add sync task to vagrant machine

* Fixup

* Fixup

* Ensure logging helper is disabled

* Fix factory to add real task

* Update field type
alukach added a commit that referenced this pull request Dec 6, 2017
* Fixup docstring, ensure no duplicate namespace

* Defer queue setup to workertoolbox

* Add circuit breaker for celery tooling

* Cleanup Celery queue prefix settings

* Ensure all tasks are found

* Add sync task to vagrant machine

* Fixup

* Fixup

* Ensure logging helper is disabled

* Fix factory to add real task

* Update field type
@alukach alukach deleted the async/circuit-breaker branch December 6, 2017 19:06
alukach added a commit that referenced this pull request Dec 19, 2017
* Fixup docstring, ensure no duplicate namespace

* Defer queue setup to workertoolbox

* Add circuit breaker for celery tooling

* Cleanup Celery queue prefix settings

* Ensure all tasks are found

* Add sync task to vagrant machine

* Fixup

* Fixup

* Ensure logging helper is disabled

* Fix factory to add real task

* Update field type
alukach added a commit that referenced this pull request Jan 3, 2018
* Fixup docstring, ensure no duplicate namespace

* Defer queue setup to workertoolbox

* Add circuit breaker for celery tooling

* Cleanup Celery queue prefix settings

* Ensure all tasks are found

* Add sync task to vagrant machine

* Fixup

* Fixup

* Ensure logging helper is disabled

* Fix factory to add real task

* Update field type
alukach added a commit that referenced this pull request Jan 24, 2018
* Fixup docstring, ensure no duplicate namespace

* Defer queue setup to workertoolbox

* Add circuit breaker for celery tooling

* Cleanup Celery queue prefix settings

* Ensure all tasks are found

* Add sync task to vagrant machine

* Fixup

* Fixup

* Ensure logging helper is disabled

* Fix factory to add real task

* Update field type
alukach added a commit that referenced this pull request Jan 24, 2018
* Add backend-tooling for asynchronous tasks (#1624)

* Async: Configure DB and Cadasta Platform dev VM to support async workers (#1800)

* Refactor download form to schedule task (#1799)

* Async: Display export tasks on project dashboard (#1801)

* Async: Implement circuit breaker for celery tooling (#1830)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants