Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(pacer): Refine multi-document page handling logic #402

Merged
merged 19 commits into from
Dec 17, 2024

Conversation

ERosendo
Copy link
Contributor

@ERosendo ERosendo commented Sep 30, 2024

Key changes:

  • Refines the handleCombinedPdfPageView (appellate) and handleCombinedPDFView (district) methods to accurately identify multi-document pages containing only one PDF file. By analyzing the HTML structure, I noticed that receipt tables are enclosed within center divs, and the number of these divs corresponds to the number of files in the combined PDF. Both methods now check for the presence of center nodes to determine if a warning should be displayed.

    In appellate pages, an additional filter was implemented to ensure accurate counting, as center divs may also be used to wrap the page's main content.

  • In both district and appellate courts, the document ID is often not directly accessible within the HTML structure of the page. While some courts use the document ID as the entry number, this is not a consistent practice across all jurisdictions. To address this challenge, this PR introduces two helper methods that uses the URL of the PACER page and the existing DocToCases mapping stored in our local storage:

    • District court URLs frequently contain a query parameter named exclude_attachments. This parameter is a comma-separated list of shortened document IDs that are not included in the combined PDF. By parsing this list and comparing it to the DocToCases mapping, we can identify the missing document ID.

      This PR introduces the getPacerDocIdFromExcludeList helper function. It takes a list of excluded document IDs as input and returns the corresponding document ID based on the DocToCases mapping.

    • Appellate court URLs often include a query parameter named dls. This parameter is a comma-separated list of shortened document IDs that are included in the combined PDF. By filtering the DocToCases mapping based on this list, we can determine the document ID.

      The getPacerDocIdFromPartialId method implements this filtering process, taking the partial as input and returning the extracted document ID.

  • Introduces a new utility function, parseDataFromReceiptTable, to extract data from receipt tables in appellate courts. While parsing the title alone is often enough for single-document pages, it lacks the necessary information to identify the document in multi-document pages. To address this limitation, this function extracts data directly from the receipt table, providing a more reliable and comprehensive approach.

  • Integrate all helper functions into the handleCombinedPdfPageView (appellate) and handleCombinedPDFView (district) methods. This will enable us to insert banners for available documents and upload the PDFs to the recap archive.

Here are GIFs showing how our extension works in appellate and district courts:

  • District Court:

Screen Recording 2024-10-01 at 4 48 23 PM

  • Appellate Court:

Screen Recording 2024-10-01 at 3 52 34 PM

Fixes freelawproject/recap#349

@ERosendo ERosendo force-pushed the 349-feat-identify-multidoc-pages-with-one-doc branch 7 times, most recently from c562279 to 6ba914a Compare October 1, 2024 11:58
@ERosendo ERosendo force-pushed the 349-feat-identify-multidoc-pages-with-one-doc branch 5 times, most recently from 8e2c8ed to e679a22 Compare October 1, 2024 19:54
This commit introduces a helper function that encasuplates logic to check if a specific document within a combined PDF page is available in the recap archive.
Ensures that the `docsToCases` mapping is correctly populated when processing attachment pages.
Adds a new utility function to retrieve the `DocToCases` mapping from storage
Introduces a new function to determine if a particular document within a multi-doc page is available in the recap archive.
This commit introduces a new utility function to efficiently extract data from receipt tables, addressing the limitation of multi-document pages. This enhancement improves the extension's ability to accurately process documents.
@ERosendo ERosendo force-pushed the 349-feat-identify-multidoc-pages-with-one-doc branch from e679a22 to dbb4b31 Compare October 1, 2024 22:54
@ERosendo ERosendo marked this pull request as ready for review October 1, 2024 22:54
@ERosendo ERosendo requested a review from mlissner October 1, 2024 22:55
@ERosendo
Copy link
Contributor Author

ERosendo commented Oct 1, 2024

@mlissner in my last commit, I implemented a MIME type validation to prevent the upload of invalid file formats. During testing, I encountered an issue with certain district courts, such as case 2:24-mj-00100, where downloading a single document from a multi-document page seemed restricted. Despite attempts in both Chrome and Firefox with and without extensions, I consistently received the error message: Cannot redisplay /tmp/1727589-2--109361.pdf, it has already been shown once. While some court tips and tricks page suggests it might be a Chrome-related issue, my testing indicated that the error was not browser-specific.

Upon further investigation, I discovered that the extension was sending the HTML page containing the error message to the CL API (not great). By implementing the validation, we can prevent the upload of the invalid HTML content.

Here are gifs showing the error message in different browsers:

  • Chrome:

Screen Recording 2024-10-01 at 7 10 16 PM

  • Firefox:

Screen Recording 2024-10-01 at 7 11 44 PM

  • Safari:

Screen Recording 2024-10-01 at 7 13 51 PM

Copy link
Contributor

@elisa-a-v elisa-a-v left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The descriptive names for variables and methods, as well as the comprehensive comments explaining things are very much appreciated and quite helpful, thank you! In general the code LGTM with just a couple of minor things related to duplicated code.

While testing I did notice a few bugs:

  1. This docket's fourth entry has two attachments, with only the second one available in RECAP. When navigating to PACER to buy the first attachment, the extension displayed the banner indicating that it was already available in RECAP, but the link was actually for the second attachment and the document displayed had two superposed different headers.
    Screenshot from 2024-12-09 18-50-20

  2. Similar to the previous bug, after buying attachment 5 in the first doc in this docket, I tried to buy attachment 6, and I saw the banner indicating the document was available. When I tried to buy it anyway, the PACER site displayed attachment 5 instead of 6, and nothing was uploaded to RECAP.

  3. When buying doc number 1208679954 attachment 2 for this docket (appellate), I could see the document in PACER and the extension displayed the notification that the document was successfully uploaded to RECAP, but the document was not available and the processing queue displayed an error. The document was finally successfully uploaded by a later processing queue (actually two PQs were successful: this and this) and is now available in RECAP.

  4. When buying attachment 1 in the same doc and same docket as the previous bug, it first uploaded it successfully but without an attachment number, so the RECAP page displayed it as a main document. A few minutes later upon refreshing the docket page, the document was gone again. I think this was the processing queue that uploaded it, which has a successful status but has no RECAP document.
    Screenshot from 2024-12-10 17-46-54
    Screenshot from 2024-12-10 17-47-03

  5. This one might be PACER's fault, but on my first try doing anything with this docket, I tried buying attachment 2 in doc number 1, and got an error message Cannot redisplay /tmp/9213722-2--117083.pdf, it has already been shown once. The document is not available in RECAP and I cannot buy it from PACER. Same with the rest of the 1-page attachments. yeah you already described this in your last comment I'm sorry I missed that!! 🤦🏽 I am not getting any errors from the extension with this issue.

Considering this PR already introduces a fair amount of changes, I'm not sure numbers 3 and 4 should be addressed here, but I'll leave that up to ye @mlissner and @ERosendo

src/content_delegate.js Outdated Show resolved Hide resolved
src/appellate/appellate.js Outdated Show resolved Hide resolved
@elisa-a-v elisa-a-v self-requested a review December 10, 2024 22:22
Copy link
Contributor

@elisa-a-v elisa-a-v left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops I accidentally sent the last review before I checked the request changes option, so I'm requesting changes now 😅

@mlissner
Copy link
Member

My vote is that if the bugs you found aren't coming from this code, they should get filed and analyzed elsewhere, and we should get this merged now. If they're part of this code, then we should make sure we address them (or choose not to).

@mlissner mlissner assigned ERosendo and unassigned elisa-a-v Dec 11, 2024
@elisa-a-v
Copy link
Contributor

elisa-a-v commented Dec 11, 2024

if the bugs you found aren't coming from this code, they should get filed and analyzed elsewhere

That makes sense to me but the behavior cannot be reproduced in the live extension, because they are all attachments in appellate courts so we always get the extension banner warning that the doc cannot be uploaded, so I'm not entirely sure if this has to do with the extension or not.
image

I understand we have identified other issues regarding appellate courts, I wonder if these are more closely related to those issues rather than this one?

I think at least 3 and 4 are probably a CL issue with the processing queues and/or related tasks, more so than the extension itself, don't you think?

@mlissner
Copy link
Member

I'm going to duck out of having an opinion on these, but I trust that you and Eduardo can puzzle these out. Sorry! These are always tricky issues. :)

@ERosendo
Copy link
Contributor Author

ERosendo commented Dec 11, 2024

@elisa-a-v Thanks for the thorough review!! I have addressed your suggestions by refactoring the duplicated code and fixing bugs 1 and 4. Unfortunately, I couldn't reproduce bugs 2 and 3.

Bug 1 was caused by a slight difference in how ACMS identifies documents compared to CMECF district and appellate courts. My initial approach used the same logic for all three, but commit d1622c0 adds an extra step to ensure the banner is displayed correctly for ACMS documents.

On the other hand, bug 4 occurred because the extension didn't send the attachment number during the document upload process. I believe CourtListener removed your upload after it processed an attachment page upload for that entry. Entries with attachments from appellate courts don't have main documents, so the merger code updated them accordingly. Commit 864e4cd introduces a new mapping similar to docsToCases, allowing us to retrieve the attachment number from the download page when it's missing from the HTML code.

@ERosendo ERosendo requested a review from elisa-a-v December 11, 2024 21:56
@ERosendo ERosendo assigned elisa-a-v and unassigned ERosendo Dec 11, 2024
Copy link
Contributor

@elisa-a-v elisa-a-v left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! :shipit:

@mlissner
Copy link
Member

There it is!

@mlissner
Copy link
Member

Mash that merge button, Elisa!

@elisa-a-v elisa-a-v merged commit c218c7d into main Dec 17, 2024
8 checks passed
@elisa-a-v elisa-a-v deleted the 349-feat-identify-multidoc-pages-with-one-doc branch December 17, 2024 23:05
@elisa-a-v
Copy link
Contributor

elisa-a-v commented Dec 17, 2024

Oh no! Is this a known issue? (the failed job)
image

@mlissner
Copy link
Member

Don't think so. Guess we better file it and figure it out.

@elisa-a-v
Copy link
Contributor

I re-ran the failed job and it was now successful. Should we still look into it?

@mlissner
Copy link
Member

Let's see if it fails a few more times first, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Incorrectly identified split pages
3 participants