Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate identity hashes #5256

Merged
merged 11 commits into from
Sep 10, 2024
Merged

Migrate identity hashes #5256

merged 11 commits into from
Sep 10, 2024

Conversation

galvana
Copy link
Contributor

@galvana galvana commented Sep 4, 2024

Closes PROD-2633

⚠️ This PR contains a database migration

Description Of Changes

This PR migrates usages of the bcrypt hash to SHA-256 + salt for non-credential data. The bcrypt hash function is a computationally intensive function that should only be used for passwords, client secrets, and other credentials. The use of bcrypt to hash identity data like emails and phone numbers was causing performance issues for any endpoint that hashed incoming identities either for saving or for searching.

At a high level the migration approach is:

  • Update any searches that rely on identity hashes to search using a bcrypt hash and a SHA-256 hash
  • Immediately start saving identities using SHA-256
  • Migrate select tables with bcrypt hashes to SHA-256 hashes
  • Once the migration is complete for a table, only search using the new SHA-256 hashes

Code Changes

  • Added an is_hash_migrated column to select tables to track the migration
  • Added an identity_table with a single encrypted value to store the system-wide identity salt
    • Accessed with a @cache decorated get_identity_salt function that also initializes the salt value on server startup if it doesn't exist
  • Updated filters in the consent and privacy request endpoints to use both the bcrypt and SHA-256 hashes so we can cut over as soon as the migrations are complete (see inline comments for details)
  • Migrated the existing hash_with_salt to hash_credential_with_salt (bcrypt) for passwords and hash_value_with_salt (SHA-256) for identity values
  • HashMigrationMixin with shared migration logic
  • bcrypt_migration_task job

Steps to Confirm

  • Checkout main and create some privacy requests
  • Checkout this branch and check the logs for Completed hash migration for...
  • Check the privacyrequest table and make sure every row has is_hash_migrated set to true

This is the easiest check but this migration applies to

  • CurrentPrivacyPreference
  • ProvidedIdentity
  • CustomPrivacyRequestField
  • PrivacyPreferenceHistory
  • ServedNoticeHistory

Pre-Merge Checklist

  • All CI Pipelines Succeeded
  • Issue Requirements are Met
  • Update CHANGELOG.md
  • If there are any database migrations:
    • Ensure that your downrev is up to date with the latest revision on main
    • Ensure that your downgrade() migration is correct and works

Copy link

vercel bot commented Sep 4, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
fides-plus-nightly ⬜️ Ignored (Inspect) Visit Preview Sep 10, 2024 1:01am

Copy link

cypress bot commented Sep 4, 2024

fides    Run #9858

Run Properties:  status check passed Passed #9858  •  git commit 49c6df2802 ℹ️: Merge ed95f2c8f15bb34d8f38603cbe3114e6b3aacb89 into 8350397d800c0a865512c494e771...
Project fides
Branch Review refs/pull/5256/merge
Run status status check passed Passed #9858
Run duration 00m 35s
Commit git commit 49c6df2802 ℹ️: Merge ed95f2c8f15bb34d8f38603cbe3114e6b3aacb89 into 8350397d800c0a865512c494e771...
Committer Adrian Galvan
View all properties for this run ↗︎

Test results
Tests that failed  Failures 0
Tests that were flaky  Flaky 0
Tests that did not run due to a developer annotating a test with .skip  Pending 0
Tests that did not run due to a failure in a mocha hook  Skipped 0
Tests that passed  Passing 4
View all changes introduced in this branch ↗︎

@galvana galvana requested a review from adamsachs September 5, 2024 00:21
@galvana galvana marked this pull request as ready for review September 5, 2024 00:21
Comment on lines +34 to +50
@classmethod
@abstractmethod
def bcrypt_hash_value(
cls,
value: MultiValue,
encoding: str = "UTF-8",
) -> Optional[str]:
"""Hash value using bcrypt."""

@classmethod
@abstractmethod
def hash_value(
cls,
value: MultiValue,
encoding: str = "UTF-8",
) -> Optional[str]:
"""Hash value using SHA-256."""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I set these as abstract because I expect any class that implements the HashMigrationMixin to have these two functions. Not sure if there is a better way to do this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that looks right - you may need/want to declare the mixin class as an ABC though?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried adding it but I got this error

TypeError: metaclass conflict: the metaclass of a derived class must be a (non-strict) subclass of the metaclasses of all its bases

Going to save further research into this until I address the other issues.

src/fides/api/migrations/hash_migration_mixin.py Outdated Show resolved Hide resolved
src/fides/api/models/identity_salt.py Show resolved Hide resolved
src/fides/api/models/identity_salt.py Show resolved Hide resolved
src/fides/api/models/privacy_preference.py Show resolved Hide resolved
src/fides/api/models/privacy_request.py Show resolved Hide resolved
tests/ops/migration_tests/test_hash_migration.py Outdated Show resolved Hide resolved
Copy link
Contributor

@adamsachs adamsachs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@galvana this is looking quite good - a really nice implementation and an elegant way of handling the migration. very well documented throughout, and nicely designed - it's really easy to follow.

i've got some minor comments throughout, let me know what you think. i didn't test myself, but assume you've given it, or will give it, some rigorous manual testing, given how involved of a change it is.

i tried to leave some comments on places where we may want to mark code for eventual deprecation/removal since some of this is only needed temporarily by nature...but i certainly missed some spots. so that's just a general comment - it may be good taking a pass through and clearly marking which modules/classes/functions are just needed in an "interim" while we ensure migrations are completed....

src/fides/api/models/identity_salt.py Show resolved Hide resolved
)


@celery_app.task(base=DatabaseTask, bind=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't think this should be a celery task? you're not running it with celery, just our apscheduler task runner which doesn't interact with celery

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just following the pattern we had for other jobs, plus this has a nice way to get a database connection 🤷

Copy link
Contributor

@adamsachs adamsachs Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heh, i do appreciate looking for consistency, but i'm pretty sure that this is rarely/never intended if the function isn't actually going to be run as a celery task.

as i look a bit closer into it, i suppose it's possible that this type of usage could be intentional, since we actually leverage a different db engine (our 'task engine') within the DatabaseTask.get_new_session...so maybe there's a use case where we'd want to use that engine, even if the functions aren't running as celery tasks? it feels a bit contrived - but i'd like to check with @pattisdr if that's what was intended when adding the decorator to e.g. this function in the DSR 3.0 refactor - or if that's also just accidental usage!

i do wanna get clear on this, because as your comment indicates, if we don't start clearing this up, it's only going to cause more confusion as we add more scheduled tasks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too long ago to recall but I doubt it was intentional, just an easy way to get the session. I agree we're creating unneccesary confusion here by doing it like this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @pattisdr -

in that case @galvana, i'd suggest we remove this. it should be pretty straightforward for us to get a well-handled db session - see an analogous example here in the discovery monitor executor (this function is also run as a scheduled task!). that's using a function defined in fidesplus to retrieve the db session but we can/should add a @contextmanager decorator to our deps.get_db() in fides OSS, i think, so it can be used similarly...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you two for the input! I've promoted get_db to a context manager, let's just make sure all the tests pass

src/fides/api/models/privacy_preference.py Show resolved Hide resolved
src/fides/api/models/privacy_request.py Show resolved Hide resolved
src/fides/api/migrations/hash_migration_mixin.py Outdated Show resolved Hide resolved
src/fides/api/migrations/hash_migration_tracker.py Outdated Show resolved Hide resolved
Copy link
Contributor Author

@galvana galvana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @adamsachs. I addressed most of your concerns except for these two:

  • Still using a Celery task for the migration, I'm just sticking to the convention for our other jobs
  • I ran into conflicts adding ABC to HashMigrationMixin and decided it wasn't worth troubleshooting further at this point in the sprint

src/fides/api/migrations/hash_migration_mixin.py Outdated Show resolved Hide resolved
src/fides/api/migrations/hash_migration_mixin.py Outdated Show resolved Hide resolved
src/fides/api/migrations/hash_migration_tracker.py Outdated Show resolved Hide resolved
src/fides/api/models/identity_salt.py Show resolved Hide resolved
src/fides/api/models/privacy_request.py Show resolved Hide resolved
Comment on lines +34 to +50
@classmethod
@abstractmethod
def bcrypt_hash_value(
cls,
value: MultiValue,
encoding: str = "UTF-8",
) -> Optional[str]:
"""Hash value using bcrypt."""

@classmethod
@abstractmethod
def hash_value(
cls,
value: MultiValue,
encoding: str = "UTF-8",
) -> Optional[str]:
"""Hash value using SHA-256."""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried adding it but I got this error

TypeError: metaclass conflict: the metaclass of a derived class must be a (non-strict) subclass of the metaclasses of all its bases

Going to save further research into this until I address the other issues.

)


@celery_app.task(base=DatabaseTask, bind=True)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just following the pattern we had for other jobs, plus this has a nice way to get a database connection 🤷

@galvana galvana requested a review from adamsachs September 9, 2024 17:12
Copy link
Contributor

@adamsachs adamsachs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great @galvana , thanks for the extra care in testing some of my questions out! 👍

only lingering question for me is the celery decorator, don't need to hold up approval on that but i would like to get that resolved one way or another before we merge

@galvana galvana merged commit 84c8b1e into main Sep 10, 2024
14 checks passed
@galvana galvana deleted the PROD-2633-migrate-identity-hashes branch September 10, 2024 01:03
Copy link

cypress bot commented Sep 10, 2024

fides    Run #9859

Run Properties:  status check passed Passed #9859  •  git commit 84c8b1e2e2: Migrate identity hashes (#5256)
Project fides
Branch Review main
Run status status check passed Passed #9859
Run duration 00m 35s
Commit git commit 84c8b1e2e2: Migrate identity hashes (#5256)
Committer Adrian Galvan
View all properties for this run ↗︎

Test results
Tests that failed  Failures 0
Tests that were flaky  Flaky 0
Tests that did not run due to a developer annotating a test with .skip  Pending 0
Tests that did not run due to a failure in a mocha hook  Skipped 0
Tests that passed  Passing 4
View all changes introduced in this branch ↗︎

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants