Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tired: message.assignment_id; Wired: message.campaign_contact_id #1311

Merged
merged 13 commits into from
Jan 14, 2020

Conversation

schuyler1d
Copy link
Collaborator

@schuyler1d schuyler1d commented Dec 18, 2019

move code from using message.assignment_id => message.campaign_contact_id; also adds message.messageservice_sid

TODO:

  • migration and schema defining the new indexes desired on message
  • migration documentation

Description

Schema changes for better scaling in the prototype scaling branch

Architecture Proposal

This creates a schema change where many places in the code joined message.assignment_id = campaign_contact.assignment_id AND message.contact_number = campaign_contact.cell

Motivation

This has some negative performance issues:

  • Updating assignments requires a bulk update (and lock) of the message table -- a costly and dangerous thing with high throughput
  • Large difficulties in creating a scaling structure that will lookup records based on campaign contact id

There is precedence and experience in the Politics Rewired branch

Proposal Details

This varies slightly with the Politics Rewired structure. In ours, we also add a message.messageservice_sid column. This allows us to track something in the table that can participate in lookup for an incoming message. In the future orgs (and/or contacts) can have different meessageservice_sids -- this can separate accounting aspects and also how to track different orgs texting the same contact. In Twilio, a messageservice_sid corresponds to the Message Service used. However, it could also be a 'from' number (and is for Nexmo in this PR) -- i.e. anything unique that is known before sending and is identifiable from an incoming message.

Deferred/Future Work

  1. We do not yet create the multi-tenant algorithms or structure for looking up a messageservice_sid or a particular organization/setup.
  2. We do not yet add any caching changes that this work will make easier
  3. We do not address a discovered issue with Nexmo support (deprecated) of needing to provide/supply a phone number to send from in the current code base -- instead we just document it inline.

Checklist:

  • I have manually tested my changes on desktop and mobile
  • The test suite passes locally with my changes
  • [na] If my change is a UI change, I have attached a screenshot to the description section of this pull request
  • My change is 300 lines of code or less, or has a documented reason in the description why it’s longer
  • I have made any necessary changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • My PR is labeled [WIP] if it is in progress

…t_id; also adds message.messageservice_sid

TODO: migration and schema defining the new indexes desired on message
@schuyler1d schuyler1d changed the title [WIP] Tired: message.assignment_id; Wired: message.campaign_contact_id Tired: message.assignment_id; Wired: message.campaign_contact_id Dec 20, 2019
strangeways
strangeways previously approved these changes Dec 21, 2019
@ibrand
Copy link
Collaborator

ibrand commented Dec 23, 2019

@schuyler1d noticing the test suite is failing here, do you know why?

@ibrand ibrand added the S-needs tests to pass Status: PRs with failing tests label Dec 23, 2019
schuyler1d and others added 2 commits December 24, 2019 11:50
The deleted test was replaced by __test__/server/api/createOptOut.test.js
@schuyler1d
Copy link
Collaborator Author

@schuyler1d noticing the test suite is failing here, do you know why?

@ibrand the main test test/server/texter.test/correctOptOut.test.js has some loading issues (basically we should fully deprecate getGql in tests and move to runGql). Larry did this in this PR where he writes a better test with runGql

To fix this in my branch, I git cherry-pick 22e2d727e3d3ad3d1457160f0cdc96442bc50fc7 (needs a remote add of Larry's fork) and pushed. I believe that will fix this in my branch and also work during merge. 🤞

@schuyler1d
Copy link
Collaborator Author

nope, it's something else:

FAIL __test__/server/api/lib/twilio.test.js
  ✕ postMessageSend success should save message and update contact state (269ms)
    error: duplicate key value violates unique constraint "pg_type_typname_nsp_index"
      at Connection.Object.<anonymous>.Connection.parseE (node_modules/pg/lib/connection.js:567:11)
      at Connection.Object.<anonymous>.Connection.parseMessage (node_modules/pg/lib/connection.js:391:17)
      at Socket.<anonymous> (node_modules/pg/lib/connection.js:129:22)

I think I see the issue (basically we need to use transactions in twilio.postMessageSave -- after trying to save 'just once'

@schuyler1d
Copy link
Collaborator Author

ok, tests passing -- letting the default migrations stay transactional in all cases which makes tests pass more easily, and also anyone that has a bigger db to migrate would have to be manually running these migrations anyway.

@schuyler1d
Copy link
Collaborator Author

schuyler1d commented Dec 24, 2019

@ibrand please review these instructions, but also they would go into wherever we want them for the RELEASE_NOTES.

Instructions for migrating large database instances (>1million messages/contacts)

  1. Make sure SUPPRESS_MIGRATIONS=1 in your environment
  2. Before deploying the code, you will want to do most of these changes (manually) beforehand, ideally while the system is down. If you are planning to do these migrations before deploying the new code, we strongly recommend merging this PR separate from the schema changes BEFORE these steps to your production instance. It removes a few bugs which caused postgres to deadlock on message table updates. Instructions below assume Postgres.

2.1 Create the additional columns:

ALTER TABLE message ADD COLUMN campaign_contact_id integer NULL REFERENCES campaign_contact(id);
ALTER TABLE message ADD COLUMN messageservice_sid text NULL;
ALTER TABLE message ALTER COLUMN assignment_id DROP NOT NULL;

These should be non-blocking and 'quick' -- i.e. downtime is probably unnecessary.

2.2 Once those complete, you will want to create the campaign_contact_id INDEX:

CREATE INDEX CONCURRENTLY message_campaign_contact_id_index ON message (campaign_contact_id)

The CONCURRENTLY adverb should make this possible without downtime, but sometimes caution is best.

2.3 (LOCKING: Best with system unstressed/down) Next you want to start filling in the campaign_contact_id. You may want to prioritize live or recent campaigns with additional qualifications or do this in batches with a command similar to:

UPDATE message
SET campaign_contact_id = campaign_contact.id
FROM campaign_contact
WHERE message.assignment_id = campaign_contact.assignment_id
  AND message.contact_number = campaign_contact.cell
  AND message.id IN (
    SELECT id
    FROM message
    WHERE campaign_contact_id IS NULL
    LIMIT 1000000
  )

Note the "LIMIT 100000" is doing it in batches of 1 million. Another strategy would be to find the lowest campaign_contact_id value for your live campaigns and add a WHERE campaign_contact.id > XXX where XXX is that value (or do the same thing for message.id > XXX).

This WILL LOCK the MESSAGE table -- and thus stop processing of events, so you will want to do this off-hours, ideally with the system down.

Once you have done this with the majority of messages, especially messages for live campaigns, you should now be ready for real downtime.

2.4 (LOCKING: Requires downtime) The final step is to definitely take the system down if you haven't already. Complete updates to messageservice_sid and campaign_contact_id, there are some final steps:

CREATE INDEX CONCURRENTLY cell_messageservice_sid_idx ON message (contact_number, messageservice_sid);
DROP INDEX message_contact_number_index;
INSERT INTO knex_migrations (name, batch, migration_time) VALUES 
  ( '20191217125726_add_message_campaign_contact_id.js', 3, now()),
  ( '20191217130355_change_message_indexes.js', 3, now()),
  ('20191217130000_message_migrate_data.js', 3, now());

2.5 Deploy the new code!
2.6 As soon as possible (after system is working) then run:

DROP INDEX message_assignment_id_index;

NOTES:

  • One of the main challenges besides migrating and indexing a large table is that many indexes on the message table with so many inserts, can tax the insertions too much and slow the system down. So, while we are indexing both assignment_id AND contact_number AND contact_number+messageservice_sid AND campaign_contact_id, the system is likely to be over stressed. This is why we stagger the index creation and then drop the unnecessary indexes as soon as we have the system up again. Obviously if ALL these steps are done during a single downtime, event, then that would work too -- but staggering them can allow shorter and stepped periods of downtime or migration.

@ibrand
Copy link
Collaborator

ibrand commented Dec 26, 2019

@schuyler1d Thank you for writing up such thorough instructions! I'm especially into the notes around what locks the tables and what doesn't. One thing that might be helpful to add is to give ballpark time estimates. For example:
2.1 (should only take around an hour for large systems)
2.3 (likely to take overnight for large systems)

ibrand
ibrand previously approved these changes Dec 26, 2019
@ibrand ibrand temporarily deployed to spoke-review-db-message-db1x3j December 26, 2019 18:47 Inactive
@ibrand ibrand added S-ready for stage-main (qa) Status (ADMINS ONLY): PR label for those ready to be added for stage: Approved, tests, etc and removed S-needs tests to pass Status: PRs with failing tests labels Dec 26, 2019
@ibrand ibrand mentioned this pull request Dec 27, 2019
@schuyler1d schuyler1d added the S-qa staging round Status (ADMINS ONLY): PR label for being tested on stage (next step: testing in producgtion) label Jan 2, 2020
@ibrand ibrand temporarily deployed to spoke-review-db-message-db1x3j January 6, 2020 21:51 Inactive
@ibrand ibrand temporarily deployed to spoke-review-db-message-db1x3j January 13, 2020 18:35 Inactive
@schuyler1d schuyler1d mentioned this pull request Jan 13, 2020
@ibrand ibrand merged commit bd03a35 into main Jan 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-qa staging round Status (ADMINS ONLY): PR label for being tested on stage (next step: testing in producgtion) S-ready for stage-main (qa) Status (ADMINS ONLY): PR label for those ready to be added for stage: Approved, tests, etc
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants