You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
pack currently has a timeout for knowing when to pack a DB.
However, I want to use that as the indicator for when an out-of-band full crawl is complete. That is, if we grant a hall pass for a full crawl, we know we are done when we pack the database.
(There is no way of knowing when we are "done." A domain could have one page, and then next time, 10000 pages. Without knowing the full scope of a site, we can't guess when we are done. And, even then... it would be a guess.)
So, pack should behave this way:
If we receive a timer-reset message (which is what extract sends):
Start a timer for that host.
Every time we receive a timer-reset message from extract, pack resets the host timer.
When the timer fires, remove the timer from local state, and put a do-the-packing message.
Now, if we receive a do-the-packing message:
Check if there is a host timer.
If there is, ignore the do-the-packing message. This means another timer-reset came in while the packing message was making its way around the queue.
If there is not, then pack the DB.
Later, an improvement could be that the do-the-packing message says when to do it. However, this would involve a requeue loop on those messages, and we might want to think about how often pack checks the queue. That could interrupt how often we process timer-reset messages... so, for now, we'll stick with this model.
This introduces a new message type on the queue, but it means that there is minimal state in pack. If we reset mid-crawl... well. It means one of two things:
If we reset, but there are more URLs being processed, pack will receive more timer-reset messages. This is good. It will restart the sequence.
If we reset, and receive a do-the-packing message, we'll pack what exists.
It should be impossible to reset in a way that the do-the-packing message is incomplete. That is, the job will not be marked as complete until we 1) pack the DB, 2) upload it to S3, and 3) successfully return, signaling a completed job. So, the message to pack will stay on the queue even if pack crashes part-way through packing a DB.
This also means we can enqueue, via admin, a message to pack a database immediately (by issuing a do-the-packing message for a host.)
None. This is internal work about the queue, and has no external surface/concerns. Nor is there a particular attack vector that is introduced that is exceptional given the architecture of the application as a whole.
Process checklist
Has a clear story statement
Can reasonably be done in a few days (otherwise, split this up!)
Screen reader - Listen to the experience with a screen reader extension, ensure the information presented in order
Keyboard navigation - Run through acceptance criteria with keyboard tabs, ensure it works.
Text scaling - Adjust viewport to 1280 pixels wide and zoom to 200%, ensure everything renders as expected. Document 400% zoom issues with USWDS if appropriate.
The text was updated successfully, but these errors were encountered:
At a glance
In order to keep state out of the services
as a developer
I want
pack
to requeue things when it is before their time.Acceptance Criteria
We use DRY behavior-driven development wherever possible.
then...
Shepherd
Background
pack
currently has a timeout for knowing when to pack a DB.However, I want to use that as the indicator for when an out-of-band full crawl is complete. That is, if we grant a hall pass for a full crawl, we know we are done when we
pack
the database.(There is no way of knowing when we are "done." A domain could have one page, and then next time, 10000 pages. Without knowing the full scope of a site, we can't guess when we are done. And, even then... it would be a guess.)
So,
pack
should behave this way:If we receive a
timer-reset
message (which is whatextract
sends):timer-reset
message from extract,pack
resets the host timer.do-the-packing
message.Now, if we receive a
do-the-packing
message:do-the-packing
message. This means anothertimer-reset
came in while the packing message was making its way around the queue.Later, an improvement could be that the
do-the-packing
message says when to do it. However, this would involve a requeue loop on those messages, and we might want to think about how oftenpack
checks the queue. That could interrupt how often we processtimer-reset
messages... so, for now, we'll stick with this model.This introduces a new message type on the queue, but it means that there is minimal state in
pack
. If we reset mid-crawl... well. It means one of two things:pack
will receive moretimer-reset
messages. This is good. It will restart the sequence.do-the-packing
message, we'll pack what exists.It should be impossible to reset in a way that the
do-the-packing
message is incomplete. That is, the job will not be marked as complete until we 1) pack the DB, 2) upload it to S3, and 3) successfully return, signaling a completed job. So, the message to pack will stay on the queue even ifpack
crashes part-way through packing a DB.This also means we can enqueue, via
admin
, a message to pack a database immediately (by issuing ado-the-packing
message for a host.)Security Considerations
Required per CM-4.
None. This is internal work about the queue, and has no external surface/concerns. Nor is there a particular attack vector that is introduced that is exceptional given the architecture of the application as a whole.
Process checklist
If there's UI...
The text was updated successfully, but these errors were encountered: