Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1710 dsr bulk reprocess not working with 20 demo #2015

Merged
merged 6 commits into from
Dec 13, 2022

Conversation

adamsachs
Copy link
Contributor

@adamsachs adamsachs commented Dec 9, 2022

Closes #1710

Code Changes

  • remove async keyword from privacy request endpoints, as it is one contributing reason that reprocessed privacy requests are being executed concurrently when running without a worker, which is not a state we expect/handle
  • update the order in which a reprocessed privacy request cache entry is cleared, to prevent requests that encounter another error on re-processing from getting into an unrecoverable state
  • make a small tweak to celery app instantiation to ensure worker properly picks up privacy requests

Steps to Confirm

The steps in the issue describe how to reproduce the problem while using fides deploy up. But we won't have the fix available in that workflow until we merge this PR, since it requires an update ethyca-fides package on pypi - so we're in a bit of a catch 22 there. That being said, I was able to reproduce the problem on all of my local testing environments (e.g. nox -s dev and nox -s test_env if I just updated my local fides.toml to have analytics_opt_out = false. So to confirm, you could

  • update fides.toml to have analytics_opt_out = false
  • run nox -s dev -- ui
  • run through the steps to repro listed in the issue

In general, the comments I left on the issue provide some context for different testing I did and my findings - they may be useful to read while evaluating this fix.

I'll also attach the latest version of the little test script (test_fides_1710.py.txt) I had that helped to reproduce the problem and verify the fix - the script uses fides dependencies so it needs to run in a python env with ethyca-fides installed. it's also not the most beautiful script - but it gets the job done :)

Pre-Merge Checklist

  • All CI Pipelines Succeeded
  • Documentation Updated:
    • documentation complete, or draft/outline provided (tag docs-team to complete/review on this branch)
    • documentation issue created (tag docs-team to complete issue separately)
  • Issue Requirements are Met
  • Relevant Follow-Up Issues Created
  • Update CHANGELOG.md

Description Of Changes

  • Removing the async keyword from the endpoint function doesn't get super close to the "root cause" of the problems seen in DSR reprocessing. If someone comes along and writes a new async endpoint function that queues privacy requests, the underlying infrastructure will still allow us to get into a state where we've got multiple concurrent DSR executions occurring when not using a worker, which is not something that will be handled well. That being said, this was by far the simplest and least intrusive fix I could think of, and it's effective - our current routes do not put us in a bad state, whether using a worker or not, with manual approval or not, when submitted or reprocessing, etc.
    • This change also comes with the added benefit of making these endpoints not clog up the whole server until they finish executing. This is a bit counterintuitive, because we're removing the async keyword. But because of the way that fastAPI works, when the endpoint functions were async, they were being executed on the app's main event loop; if the underlying work it was doing was not effectively leveraging async/await functionality - and most of it is not - then all of that work would clog up the entire web server. This did not impact manually approved privacy requests (since they were using a sync endpoint), and it didn't impact execution with a worker (since that work was successfully pushed off onto another process); but when running without a worker, and when bulk re-processing requests, or when creating a privacy request without manual approval required (both async endpoints), the whole server would lock up until the privacy requests finished execution! With this change, they no longer do that since the sync endpoints do all of their work on a separate thread/event loop than the one used by the main webserver
    • Note that we introduced some of these async endpoints in Make log send async fidesops#1174 while generally adding async functionality to the ops codebase to avoid errors with the fideslog client calls. The changes in this PR don't impact this, since what was needed there was for the celery task function to be async; we can call the async celery task from sync endpoints, as we do in this PR, due to the @sync wrapper we have.
  • The other change, that switches when the cache is cleared during privacy request re-execution, was not a direct fix for this ticket's bug, but it came up in some testing. Before this fix, privacy requests could effectively only be re-executed once, since their cached "failed checkpoint details" state was getting cleared after their re-execution ended. With this change, privacy requests should be able to be re-executed as many times as needed.

resolves issue where we were clearing the failed step cache AFTER the
execution for a privacy request reprocess, such that subsequent reprocess calls
could not find a failed step.
worker were not properly picking up privacy requests without this
@adamsachs adamsachs linked an issue Dec 9, 2022 that may be closed by this pull request
@adamsachs adamsachs marked this pull request as ready for review December 13, 2022 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DSR Bulk Reprocess not working with 2.0 Demo
2 participants