-
-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(pacer): Refine multi-document page handling logic #402
feat(pacer): Refine multi-document page handling logic #402
Conversation
c562279
to
6ba914a
Compare
8e2c8ed
to
e679a22
Compare
This commit introduces a helper function that encasuplates logic to check if a specific document within a combined PDF page is available in the recap archive.
Ensures that the `docsToCases` mapping is correctly populated when processing attachment pages.
Adds a new utility function to retrieve the `DocToCases` mapping from storage
Introduces a new function to determine if a particular document within a multi-doc page is available in the recap archive.
This commit introduces a new utility function to efficiently extract data from receipt tables, addressing the limitation of multi-document pages. This enhancement improves the extension's ability to accurately process documents.
e679a22
to
dbb4b31
Compare
@mlissner in my last commit, I implemented a Upon further investigation, I discovered that the extension was sending the HTML page containing the error message to the CL API (not great). By implementing the validation, we can prevent the upload of the invalid HTML content. Here are gifs showing the error message in different browsers:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The descriptive names for variables and methods, as well as the comprehensive comments explaining things are very much appreciated and quite helpful, thank you! In general the code LGTM with just a couple of minor things related to duplicated code.
While testing I did notice a few bugs:
-
This docket's fourth entry has two attachments, with only the second one available in RECAP. When navigating to PACER to buy the first attachment, the extension displayed the banner indicating that it was already available in RECAP, but the link was actually for the second attachment and the document displayed had two superposed different headers.
-
Similar to the previous bug, after buying attachment 5 in the first doc in this docket, I tried to buy attachment 6, and I saw the banner indicating the document was available. When I tried to buy it anyway, the PACER site displayed attachment 5 instead of 6, and nothing was uploaded to RECAP.
-
When buying doc number 1208679954 attachment 2 for this docket (appellate), I could see the document in PACER and the extension displayed the notification that the document was successfully uploaded to RECAP, but the document was not available and the processing queue displayed an error. The document was finally successfully uploaded by a later processing queue (actually two PQs were successful: this and this) and is now available in RECAP.
-
When buying attachment 1 in the same doc and same docket as the previous bug, it first uploaded it successfully but without an attachment number, so the RECAP page displayed it as a main document. A few minutes later upon refreshing the docket page, the document was gone again. I think this was the processing queue that uploaded it, which has a successful status but has no RECAP document.
-
This one might be PACER's fault, but on my first try doing anything with this docket, I tried buying attachment 2 in doc number 1, and got an error messageyeah you already described this in your last comment I'm sorry I missed that!! 🤦🏽 I am not getting any errors from the extension with this issue.Cannot redisplay /tmp/9213722-2--117083.pdf, it has already been shown once.
The document is not available in RECAP and I cannot buy it from PACER. Same with the rest of the 1-page attachments.
Considering this PR already introduces a fair amount of changes, I'm not sure numbers 3 and 4 should be addressed here, but I'll leave that up to ye @mlissner and @ERosendo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops I accidentally sent the last review before I checked the request changes option, so I'm requesting changes now 😅
My vote is that if the bugs you found aren't coming from this code, they should get filed and analyzed elsewhere, and we should get this merged now. If they're part of this code, then we should make sure we address them (or choose not to). |
I'm going to duck out of having an opinion on these, but I trust that you and Eduardo can puzzle these out. Sorry! These are always tricky issues. :) |
@elisa-a-v Thanks for the thorough review!! I have addressed your suggestions by refactoring the duplicated code and fixing bugs 1 and 4. Unfortunately, I couldn't reproduce bugs 2 and 3. Bug 1 was caused by a slight difference in how ACMS identifies documents compared to CMECF district and appellate courts. My initial approach used the same logic for all three, but commit d1622c0 adds an extra step to ensure the banner is displayed correctly for ACMS documents. On the other hand, bug 4 occurred because the extension didn't send the attachment number during the document upload process. I believe CourtListener removed your upload after it processed an attachment page upload for that entry. Entries with attachments from appellate courts don't have main documents, so the merger code updated them accordingly. Commit 864e4cd introduces a new mapping similar to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There it is! |
Mash that merge button, Elisa! |
Oh no! Is this a known issue? (the failed job) |
Don't think so. Guess we better file it and figure it out. |
I re-ran the failed job and it was now successful. Should we still look into it? |
Let's see if it fails a few more times first, thanks. |
Key changes:
Refines the
handleCombinedPdfPageView
(appellate) andhandleCombinedPDFView
(district) methods to accurately identify multi-document pages containing only one PDF file. By analyzing the HTML structure, I noticed that receipt tables are enclosed within center divs, and the number of these divs corresponds to the number of files in the combined PDF. Both methods now check for the presence of center nodes to determine if a warning should be displayed.In appellate pages, an additional filter was implemented to ensure accurate counting, as center divs may also be used to wrap the page's main content.
In both district and appellate courts, the document ID is often not directly accessible within the HTML structure of the page. While some courts use the document ID as the entry number, this is not a consistent practice across all jurisdictions. To address this challenge, this PR introduces two helper methods that uses the URL of the PACER page and the existing
DocToCases
mapping stored in our local storage:District court URLs frequently contain a query parameter named
exclude_attachments
. This parameter is a comma-separated list of shortened document IDs that are not included in the combined PDF. By parsing this list and comparing it to the DocToCases mapping, we can identify the missing document ID.This PR introduces the
getPacerDocIdFromExcludeList
helper function. It takes a list of excluded document IDs as input and returns the corresponding document ID based on the DocToCases mapping.Appellate court URLs often include a query parameter named
dls
. This parameter is a comma-separated list of shortened document IDs that are included in the combined PDF. By filtering the DocToCases mapping based on this list, we can determine the document ID.The
getPacerDocIdFromPartialId
method implements this filtering process, taking the partial as input and returning the extracted document ID.Introduces a new utility function,
parseDataFromReceiptTable
, to extract data from receipt tables in appellate courts. While parsing the title alone is often enough for single-document pages, it lacks the necessary information to identify the document in multi-document pages. To address this limitation, this function extracts data directly from the receipt table, providing a more reliable and comprehensive approach.Integrate all helper functions into the
handleCombinedPdfPageView
(appellate) andhandleCombinedPDFView
(district) methods. This will enable us to insert banners for available documents and upload the PDFs to the recap archive.Here are GIFs showing how our extension works in appellate and district courts:
Fixes freelawproject/recap#349