feat(pacer): Refine multi-document page handling logic #402

ERosendo · 2024-09-30T14:52:05Z

Key changes:

Refines the handleCombinedPdfPageView (appellate) and handleCombinedPDFView (district) methods to accurately identify multi-document pages containing only one PDF file. By analyzing the HTML structure, I noticed that receipt tables are enclosed within center divs, and the number of these divs corresponds to the number of files in the combined PDF. Both methods now check for the presence of center nodes to determine if a warning should be displayed.

In appellate pages, an additional filter was implemented to ensure accurate counting, as center divs may also be used to wrap the page's main content.
In both district and appellate courts, the document ID is often not directly accessible within the HTML structure of the page. While some courts use the document ID as the entry number, this is not a consistent practice across all jurisdictions. To address this challenge, this PR introduces two helper methods that uses the URL of the PACER page and the existing DocToCases mapping stored in our local storage:
- District court URLs frequently contain a query parameter named exclude_attachments. This parameter is a comma-separated list of shortened document IDs that are not included in the combined PDF. By parsing this list and comparing it to the DocToCases mapping, we can identify the missing document ID.
  
  This PR introduces the getPacerDocIdFromExcludeList helper function. It takes a list of excluded document IDs as input and returns the corresponding document ID based on the DocToCases mapping.
- Appellate court URLs often include a query parameter named dls. This parameter is a comma-separated list of shortened document IDs that are included in the combined PDF. By filtering the DocToCases mapping based on this list, we can determine the document ID.
  
  The getPacerDocIdFromPartialId method implements this filtering process, taking the partial as input and returning the extracted document ID.
Introduces a new utility function, parseDataFromReceiptTable, to extract data from receipt tables in appellate courts. While parsing the title alone is often enough for single-document pages, it lacks the necessary information to identify the document in multi-document pages. To address this limitation, this function extracts data directly from the receipt table, providing a more reliable and comprehensive approach.
Integrate all helper functions into the handleCombinedPdfPageView (appellate) and handleCombinedPDFView (district) methods. This will enable us to insert banners for available documents and upload the PDFs to the recap archive.

Here are GIFs showing how our extension works in appellate and district courts:

District Court:

Appellate Court:

This commit introduces a helper function that encasuplates logic to check if a specific document within a combined PDF page is available in the recap archive.

Ensures that the `docsToCases` mapping is correctly populated when processing attachment pages.

Adds a new utility function to retrieve the `DocToCases` mapping from storage

Introduces a new function to determine if a particular document within a multi-doc page is available in the recap archive.

This commit introduces a new utility function to efficiently extract data from receipt tables, addressing the limitation of multi-document pages. This enhancement improves the extension's ability to accurately process documents.

ERosendo · 2024-10-01T23:20:10Z

@mlissner in my last commit, I implemented a MIME type validation to prevent the upload of invalid file formats. During testing, I encountered an issue with certain district courts, such as case 2:24-mj-00100, where downloading a single document from a multi-document page seemed restricted. Despite attempts in both Chrome and Firefox with and without extensions, I consistently received the error message: Cannot redisplay /tmp/1727589-2--109361.pdf, it has already been shown once. While some court tips and tricks page suggests it might be a Chrome-related issue, my testing indicated that the error was not browser-specific.

Upon further investigation, I discovered that the extension was sending the HTML page containing the error message to the CL API (not great). By implementing the validation, we can prevent the upload of the invalid HTML content.

Here are gifs showing the error message in different browsers:

Chrome:

Firefox:

Safari:

elisa-a-v

The descriptive names for variables and methods, as well as the comprehensive comments explaining things are very much appreciated and quite helpful, thank you! In general the code LGTM with just a couple of minor things related to duplicated code.

While testing I did notice a few bugs:

This docket's fourth entry has two attachments, with only the second one available in RECAP. When navigating to PACER to buy the first attachment, the extension displayed the banner indicating that it was already available in RECAP, but the link was actually for the second attachment and the document displayed had two superposed different headers.
Similar to the previous bug, after buying attachment 5 in the first doc in this docket, I tried to buy attachment 6, and I saw the banner indicating the document was available. When I tried to buy it anyway, the PACER site displayed attachment 5 instead of 6, and nothing was uploaded to RECAP.
When buying doc number 1208679954 attachment 2 for this docket (appellate), I could see the document in PACER and the extension displayed the notification that the document was successfully uploaded to RECAP, but the document was not available and the processing queue displayed an error. The document was finally successfully uploaded by a later processing queue (actually two PQs were successful: this and this) and is now available in RECAP.
When buying attachment 1 in the same doc and same docket as the previous bug, it first uploaded it successfully but without an attachment number, so the RECAP page displayed it as a main document. A few minutes later upon refreshing the docket page, the document was gone again. I think this was the processing queue that uploaded it, which has a successful status but has no RECAP document.
This one might be PACER's fault, but on my first try doing anything with this docket, I tried buying attachment 2 in doc number 1, and got an error message Cannot redisplay /tmp/9213722-2--117083.pdf, it has already been shown once. The document is not available in RECAP and I cannot buy it from PACER. Same with the rest of the 1-page attachments. yeah you already described this in your last comment I'm sorry I missed that!! 🤦🏽 I am not getting any errors from the extension with this issue.

Considering this PR already introduces a fair amount of changes, I'm not sure numbers 3 and 4 should be addressed here, but I'll leave that up to ye @mlissner and @ERosendo

src/content_delegate.js

src/appellate/appellate.js

elisa-a-v

Oops I accidentally sent the last review before I checked the request changes option, so I'm requesting changes now 😅

mlissner · 2024-12-11T00:45:32Z

My vote is that if the bugs you found aren't coming from this code, they should get filed and analyzed elsewhere, and we should get this merged now. If they're part of this code, then we should make sure we address them (or choose not to).

elisa-a-v · 2024-12-11T16:50:41Z

if the bugs you found aren't coming from this code, they should get filed and analyzed elsewhere

That makes sense to me but the behavior cannot be reproduced in the live extension, because they are all attachments in appellate courts so we always get the extension banner warning that the doc cannot be uploaded, so I'm not entirely sure if this has to do with the extension or not.

I understand we have identified other issues regarding appellate courts, I wonder if these are more closely related to those issues rather than this one?

I think at least 3 and 4 are probably a CL issue with the processing queues and/or related tasks, more so than the extension itself, don't you think?

mlissner · 2024-12-11T17:35:48Z

I'm going to duck out of having an opinion on these, but I trust that you and Eduardo can puzzle these out. Sorry! These are always tricky issues. :)

ERosendo · 2024-12-11T21:55:48Z

@elisa-a-v Thanks for the thorough review!! I have addressed your suggestions by refactoring the duplicated code and fixing bugs 1 and 4. Unfortunately, I couldn't reproduce bugs 2 and 3.

Bug 1 was caused by a slight difference in how ACMS identifies documents compared to CMECF district and appellate courts. My initial approach used the same logic for all three, but commit d1622c0 adds an extra step to ensure the banner is displayed correctly for ACMS documents.

On the other hand, bug 4 occurred because the extension didn't send the attachment number during the document upload process. I believe CourtListener removed your upload after it processed an attachment page upload for that entry. Entries with attachments from appellate courts don't have main documents, so the merger code updated them accordingly. Commit 864e4cd introduces a new mapping similar to docsToCases, allowing us to retrieve the attachment number from the download page when it's missing from the HTML code.

elisa-a-v

LGTM!

mlissner · 2024-12-17T22:59:53Z

There it is!

mlissner · 2024-12-17T22:59:59Z

Mash that merge button, Elisa!

elisa-a-v · 2024-12-17T23:10:45Z

Oh no! Is this a known issue? (the failed job)

mlissner · 2024-12-17T23:46:50Z

Don't think so. Guess we better file it and figure it out.

elisa-a-v · 2024-12-18T21:00:43Z

I re-ran the failed job and it was now successful. Should we still look into it?

mlissner · 2024-12-18T23:20:38Z

Let's see if it fails a few more times first, thanks.

ERosendo force-pushed the 349-feat-identify-multidoc-pages-with-one-doc branch 7 times, most recently from c562279 to 6ba914a Compare October 1, 2024 11:58

ERosendo added 2 commits October 1, 2024 08:30

feat(appellate): Refine multi-document page handling logic

2277d9f

feat(district): Refine multi-document page handling logic

ea84011

ERosendo force-pushed the 349-feat-identify-multidoc-pages-with-one-doc branch 5 times, most recently from 8e2c8ed to e679a22 Compare October 1, 2024 19:54

ERosendo added 11 commits October 1, 2024 18:54

feat(docs): Changelog Update

4d18676

feat(utils): Adds helper method to get pacer doc ids using exclude lists

c241760

feat(district): Adds helper function to check document availability

4a8db9b

This commit introduces a helper function that encasuplates logic to check if a specific document within a combined PDF page is available in the recap archive.

feat(appellate): Tweaks the findDocLinksFromAnchors method

3ba7123

Ensures that the `docsToCases` mapping is correctly populated when processing attachment pages.

feat(utils): Introduces getDocToCasesMapping helper function

16d580c

Adds a new utility function to retrieve the `DocToCases` mapping from storage

feat(utils): Adds helper method to get docId using shortened version

affc549

feat(appellate): Adds helper function to check document availability

6ed14c5

Introduces a new function to determine if a particular document within a multi-doc page is available in the recap archive.

feat(district): Adds logic to upload file from multi-doc page

1a1656a

feat(appellate): Adds logic to upload file from multi-doc page

43b6a8a

feat(pdf_upload): Add MIME type validation to ensure data integrity

dbb4b31

ERosendo force-pushed the 349-feat-identify-multidoc-pages-with-one-doc branch from e679a22 to dbb4b31 Compare October 1, 2024 22:54

ERosendo marked this pull request as ready for review October 1, 2024 22:54

ERosendo requested a review from mlissner October 1, 2024 22:55

mlissner assigned elisa-a-v Nov 12, 2024

mlissner requested a review from elisa-a-v November 12, 2024 22:25

ERosendo mentioned this pull request Nov 13, 2024

Incorrectly identified split pages freelawproject/recap#349

Closed

Merge branch 'main' into 349-feat-identify-multidoc-pages-with-one-doc

948ab40

elisa-a-v reviewed Dec 10, 2024

View reviewed changes

src/content_delegate.js Outdated Show resolved Hide resolved

src/appellate/appellate.js Outdated Show resolved Hide resolved

elisa-a-v self-requested a review December 10, 2024 22:22

elisa-a-v requested changes Dec 10, 2024

View reviewed changes

mlissner assigned ERosendo and unassigned elisa-a-v Dec 11, 2024

fix(acms): Tweaks logic to display banner for available documents

d1622c0

ERosendo added 4 commits December 11, 2024 15:55

feat(appellate): Introduces document-to-attachment-number mapping

864e4cd

feat(appellate): Refines logic to handle missing attachment numbers

7219201

refactor(appellate): Extract logic to create button for filers

7d775ce

feat(pacer): Adds helper to check single docs in combined PDF pages

780864a

ERosendo requested a review from elisa-a-v December 11, 2024 21:56

ERosendo assigned elisa-a-v and unassigned ERosendo Dec 11, 2024

elisa-a-v approved these changes Dec 17, 2024

View reviewed changes

elisa-a-v merged commit c218c7d into main Dec 17, 2024
8 checks passed

elisa-a-v deleted the 349-feat-identify-multidoc-pages-with-one-doc branch December 17, 2024 23:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(pacer): Refine multi-document page handling logic #402

feat(pacer): Refine multi-document page handling logic #402

ERosendo commented Sep 30, 2024 •

edited

Loading

ERosendo commented Oct 1, 2024

elisa-a-v left a comment •

edited

Loading

elisa-a-v left a comment

mlissner commented Dec 11, 2024

elisa-a-v commented Dec 11, 2024 •

edited

Loading

mlissner commented Dec 11, 2024

ERosendo commented Dec 11, 2024 •

edited

Loading

elisa-a-v left a comment

mlissner commented Dec 17, 2024

mlissner commented Dec 17, 2024

elisa-a-v commented Dec 17, 2024 •

edited

Loading

mlissner commented Dec 17, 2024

elisa-a-v commented Dec 18, 2024

mlissner commented Dec 18, 2024

feat(pacer): Refine multi-document page handling logic #402

feat(pacer): Refine multi-document page handling logic #402

Conversation

ERosendo commented Sep 30, 2024 • edited Loading

ERosendo commented Oct 1, 2024

elisa-a-v left a comment • edited Loading

Choose a reason for hiding this comment

elisa-a-v left a comment

Choose a reason for hiding this comment

mlissner commented Dec 11, 2024

elisa-a-v commented Dec 11, 2024 • edited Loading

mlissner commented Dec 11, 2024

ERosendo commented Dec 11, 2024 • edited Loading

elisa-a-v left a comment

Choose a reason for hiding this comment

mlissner commented Dec 17, 2024

mlissner commented Dec 17, 2024

elisa-a-v commented Dec 17, 2024 • edited Loading

mlissner commented Dec 17, 2024

elisa-a-v commented Dec 18, 2024

mlissner commented Dec 18, 2024

ERosendo commented Sep 30, 2024 •

edited

Loading

elisa-a-v left a comment •

edited

Loading

elisa-a-v commented Dec 11, 2024 •

edited

Loading

ERosendo commented Dec 11, 2024 •

edited

Loading

elisa-a-v commented Dec 17, 2024 •

edited

Loading