-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TST: Centralize file downloads #2324
Conversation
@pubpub-zz @stefan6419846 @exiledkingcc @MasterOdin What do you think about it? |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2324 +/- ##
=======================================
Coverage 94.37% 94.37%
=======================================
Files 43 43
Lines 7660 7660
Branches 1515 1515
=======================================
Hits 7229 7229
Misses 267 267
Partials 164 164 ☔ View full report in Codecov by Sentry. |
the idea sounds like a good idea.
|
Thanks for the proposal. While I think that this simplifies execution, this does not really resolve the network issues and (depending on the execution speed) might even run into throttling/timeouts faster. Further things which came into my mind:
|
It feels like there's some pieces missing here. As @stefan6419846 stated, how does the CSV solve these two goals:
Unless you committed the PDFs into the repo itself, the network will always be involved at some point. Furthermore, it currently looks like the tests work the same way here, where when the test runner goes, if the file does not exist locally when the test gets executed, then it'll be downloaded as part of test execution. Future runs of the test will use the cached file, and this is true of using this CSV or not. I'm also more in the camp that the PDF test files should be vendored into the repo and that while it makes working with the repo a bit more annoying due to cloning being longer, in the long run it simplifies life. PDF.js is prior art on this, where they have 1000+ pdfs in their repo and it's working fine for them.
Is the plan to have a pre-test step that downloads all files at once, and then kicks off the test runner, so that the Some other thoughts on other points:
Who and why? Unless there's a known use case right now, I would not style the decision of how to format this file around something nebulous, as it's equally (maybe moreso?) possible that no one else uses this CSV, and you make some decision that makes pypdf life more annoying for nothing. There's also a good chance that when some other project does want to re-use this CSV, they still need some change in how it works due to their own requirements.
Not sure why you wouldn't remove the url from the argument altogether. The name is the identifier that ties it to the URL via this proposed file, so having the URL at all introduces pointless redundancy, which kind of violates your 4th stated goal of doing this.
I agree with this, though I would say I am way stronger on the belief CSV is a very bad format due to its lack of standardization and should use JSON/YAML over it. In my experience, if you add any sort of complexity to your document, it ends up breaking in dumb ways. Not to mention the difference in how different OSes/programs may open this format and end up breaking it. Stick to formats that have a proper deserialization/serialization standardization (be it JSON or YAML).
This project has
Agreed. Abusing
Agreed, I don't think the file should contain a list of tests that use the file, as it just duplicates info that's easy enough to grep for. If it's hard to find the name of a file in the codebase, that would just mean the filename is poor and not unique enough, and just change the filename. |
CI currently calls the download method which is used in this PR before running the actual tests: pypdf/.github/workflows/github-ci.yaml Line 47 in 40e25ec
This depends on how you see it, how distribution is done etc. The first thing is that most test PDFs have no clean copyright status. If included inside the sdist for example, this makes license compliance hard and bloats releases. |
Hey, thank you all for the feedback :-) I'll try to group it :-) File formatI was using CSV for two reasons:
However, I don't have strong feelings about it. I'll change it to YAML as you seem to prefer it. I do like the option to add comments. Additional information in the example_files.yaml
Why does the PR give a speed-up / which problem does it solve?We can download files in parallel: 1f5ed08 Yes, it does not fix the issue that we might get network problems. But then the network problems occur in one place only. They failure will be only in the part of the tests that deal with downloading the files, not somewhere randomly in the code. That makes it easier to see where the issue comes from. We could potentially even run it as its own CI step - all other parts would only start once all files are downloaded properly.
@MasterOdin We already do that: https://github.com/py-pdf/pypdf/blob/main/.github/workflows/github-ci.yaml#L45-L47 Adding PDFs directly the the repoIt is true that this would be the simplest solution, but it has two massive disadvantages:
While I'm the maintainer, I will not change this 3-split:
PR completeness
@MasterOdin This is just an intermediate step. I want to get there, but I don't have the time to go over the ~300 lines and do it. Merging those changes step-by-step + having a discussion beforehand + ensuring that new PRs use the I want this PR to contain the complicated parts / have an agreed structure. The single lines in the tests can be done in other PRs (maybe even by new contributors? 🤞 ) |
5c4ef6f
to
a5d7e4e
Compare
a5d7e4e
to
003f2c6
Compare
## What's new ### Bug Fixes (BUG) - Cope with deflated images with CMYK Black Only (#2322) by @pubpub-zz - Handle indirect objects as parameters for CCITTFaxDecode (#2307) by @stefan6419846 - check words length in _cmap type1_alternative function (#2310) by @Takher ### Robustness (ROB) - Relax flate decoding for too many lookup values (#2331) by @stefan6419846 - Let _build_destination skip in case of missing /D key (#2018) by @nickryand ### Documentation (DOC) - Note in reading form data (#2338) by @MartinThoma - Pull Request prefixes and size by @MartinThoma - Add https://github.com/zuypt for #2325 as a contributor by @MartinThoma - Fix docstring for RunLengthDecode.decode (#2302) by @stefan6419846 ### Maintenance (MAINT) - Enable `disallow_any_generics` and add missing generics (#2278) by @nilehmann ### Testing (TST) - Centralize file downloads (#2324) by @MartinThoma ### Code Style (STY) - Fix typo "steam" \xe2\x86\x92 "stream" (#2327) by @stefan6419846 - Run black by @MartinThoma - Make Traceback in bug report template uppercase (#2304) by @stefan6419846 [Full Changelog](3.17.1...3.17.2)
This PR introduces a new
tests/example_files.yaml
in which we have the local filename as well as the URL from which we download the document.The idea is that we move all download URLs to that CSV document. Then we just reference the local file names. The advantage of this approach are:
name
for different URLs we could have flaky tests as well. That is currently super hard to detect and would be rather easy to prevent in futureMeta
This is only a part of what we would need to do. It's a basis for discussion. What is missing:
get_data_from_url
: name first, without a default. Maybe even removing the URL part completely.I'll leave it open until 9th of December so that people can comment. Alternatively, if I get a thumbs-up of two of the main contributors and no thumbs-down I'd also merge earlier :-)