Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

👜 rework pack to leverage the queue better #30

Open
1 task
jadudm opened this issue Nov 20, 2024 · 0 comments
Open
1 task

👜 rework pack to leverage the queue better #30

jadudm opened this issue Nov 20, 2024 · 0 comments

Comments

@jadudm
Copy link
Collaborator

jadudm commented Nov 20, 2024

At a glance

In order to keep state out of the services
as a developer
I want pack to requeue things when it is before their time.

Acceptance Criteria

We use DRY behavior-driven development wherever possible.

then...

Shepherd

  • UX shepherd:
  • Design shepherd:
  • Engineering shepherd:

Background

pack currently has a timeout for knowing when to pack a DB.

However, I want to use that as the indicator for when an out-of-band full crawl is complete. That is, if we grant a hall pass for a full crawl, we know we are done when we pack the database.

(There is no way of knowing when we are "done." A domain could have one page, and then next time, 10000 pages. Without knowing the full scope of a site, we can't guess when we are done. And, even then... it would be a guess.)

So, pack should behave this way:

If we receive a timer-reset message (which is what extract sends):

  1. Start a timer for that host.
  2. Every time we receive a timer-reset message from extract, pack resets the host timer.
  3. When the timer fires, remove the timer from local state, and put a do-the-packing message.

Now, if we receive a do-the-packing message:

  1. Check if there is a host timer.
    1. If there is, ignore the do-the-packing message. This means another timer-reset came in while the packing message was making its way around the queue.
    2. If there is not, then pack the DB.

Later, an improvement could be that the do-the-packing message says when to do it. However, this would involve a requeue loop on those messages, and we might want to think about how often pack checks the queue. That could interrupt how often we process timer-reset messages... so, for now, we'll stick with this model.

This introduces a new message type on the queue, but it means that there is minimal state in pack. If we reset mid-crawl... well. It means one of two things:

  1. If we reset, but there are more URLs being processed, pack will receive more timer-reset messages. This is good. It will restart the sequence.
  2. If we reset, and receive a do-the-packing message, we'll pack what exists.

It should be impossible to reset in a way that the do-the-packing message is incomplete. That is, the job will not be marked as complete until we 1) pack the DB, 2) upload it to S3, and 3) successfully return, signaling a completed job. So, the message to pack will stay on the queue even if pack crashes part-way through packing a DB.

This also means we can enqueue, via admin, a message to pack a database immediately (by issuing a do-the-packing message for a host.)

Security Considerations

Required per CM-4.

None. This is internal work about the queue, and has no external surface/concerns. Nor is there a particular attack vector that is introduced that is exceptional given the architecture of the application as a whole.


Process checklist
  • Has a clear story statement
  • Can reasonably be done in a few days (otherwise, split this up!)
  • Shepherds have been identified
  • UX youexes all the things
  • Design designs all the things
  • Engineering engineers all the things
  • Meets acceptance criteria
  • Meets QASP conditions
  • Presented in a review
  • Includes screenshots or references to artifacts
  • Tagged with the sprint where it was finished
  • Archived

If there's UI...

  • Screen reader - Listen to the experience with a screen reader extension, ensure the information presented in order
  • Keyboard navigation - Run through acceptance criteria with keyboard tabs, ensure it works.
  • Text scaling - Adjust viewport to 1280 pixels wide and zoom to 200%, ensure everything renders as expected. Document 400% zoom issues with USWDS if appropriate.
@jadudm jadudm added this to jemison Nov 20, 2024
@github-project-automation github-project-automation bot moved this to triage in jemison Nov 20, 2024
@jadudm jadudm moved this from triage to backlog in jemison Nov 20, 2024
@jadudm jadudm changed the title 🏗️ rework pack to leverage the queue better 👜 rework pack to leverage the queue better Nov 20, 2024
@jadudm jadudm moved this from backlog to triage in jemison Nov 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: triage
Development

No branches or pull requests

1 participant