Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TST: Centralize file downloads #2324

Merged
merged 10 commits into from
Dec 9, 2023
Merged

TST: Centralize file downloads #2324

merged 10 commits into from
Dec 9, 2023

Conversation

MartinThoma
Copy link
Member

@MartinThoma MartinThoma commented Dec 2, 2023

This PR introduces a new tests/example_files.yaml in which we have the local filename as well as the URL from which we download the document.

The idea is that we move all download URLs to that CSV document. Then we just reference the local file names. The advantage of this approach are:

  • Detection of name clashes: If we use the same name for different URLs we could have flaky tests as well. That is currently super hard to detect and would be rather easy to prevent in future
  • No tests needs network: The network is a source for flaky tests. That's hard to notice. If the download fails, it's easier to see.
  • Efficiency: We can heavily parallelize downloads. That might lead to speed-ups.
  • De-duplication: We might have the same file with different file-names in the code base. Having all URLs in one place makes it easier to ensure we download everything just once
  • Shorter Python code: That's a tiny stylistic argument. I prefer if we don't have super long lines / need to make weird line breaks as the URL is just so long.
  • Cross-project use: Other projects might be interested in that CSV and use it for their tests.
  • Internal re-use: It's easier to loop over the existing files we have like this. That way, we can potentially use one PDF for multiple types of tests.

Meta

This is only a part of what we would need to do. It's a basis for discussion. What is missing:

  • Moving all download URLs to the CSV
  • Removing the URLs from the rest of the code
  • Switching the arguments of get_data_from_url: name first, without a default. Maybe even removing the URL part completely.

I'll leave it open until 9th of December so that people can comment. Alternatively, if I get a thumbs-up of two of the main contributors and no thumbs-down I'd also merge earlier :-)

@MartinThoma
Copy link
Member Author

@pubpub-zz @stefan6419846 @exiledkingcc @MasterOdin What do you think about it?

@MartinThoma MartinThoma requested a review from pubpub-zz December 2, 2023 09:14
Copy link

codecov bot commented Dec 2, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (8d3f879) 94.37% compared to head (35b1ac7) 94.37%.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2324   +/-   ##
=======================================
  Coverage   94.37%   94.37%           
=======================================
  Files          43       43           
  Lines        7660     7660           
  Branches     1515     1515           
=======================================
  Hits         7229     7229           
  Misses        267      267           
  Partials      164      164           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pubpub-zz
Copy link
Collaborator

the idea sounds like a good idea.
a few ideas that could help:

  • use of JSON format instead of CSV to ease readability of the file list ; a pytest -k to_csv_output could be used to get it as a csv could be easiliy propose for those who want use a spreadsheet to reviewing
  • add 2 extra fields(only provided as comments):
    a) test file where the file is used (with a function to update it automatically pytest -k update_source
    b) a comment(optional) to give a idea what the file is used to test or use pytest -k test_file --pdf used_file.pdf which would list function names and documentation

@stefan6419846
Copy link
Collaborator

Thanks for the proposal. While I think that this simplifies execution, this does not really resolve the network issues and (depending on the execution speed) might even run into throttling/timeouts faster.

Further things which came into my mind:

  • I generally prefer YAML due to the readability. Unfortunately, YAML does not yet have stdlib support. YAML and JSON are quite similar (YAML is a superset of YAML) and might require escaping some characters, while CSV files should not require escaping in most cases (only if the data contains commas) and thus making it easier to work with. For this reason, I probably would stay with CSV for now.
  • I do not think that we should "abuse" the pytest -k test filtering switch to do development tasks. This should be covered by a dedicated script if actually required.
  • I do not think that we should start implementing an own plugin or own flags for pytest for convenience functions to generate/show metadata most developers will not even need. Just searching/grepping for a filename inside the source should be sufficient to determine where a test file is being used.

@MasterOdin
Copy link
Member

It feels like there's some pieces missing here. As @stefan6419846 stated, how does the CSV solve these two goals:

  • No tests needs network: The network is a source for flaky tests. That's hard to notice. If the download fails, it's easier to see.

Unless you committed the PDFs into the repo itself, the network will always be involved at some point. Furthermore, it currently looks like the tests work the same way here, where when the test runner goes, if the file does not exist locally when the test gets executed, then it'll be downloaded as part of test execution. Future runs of the test will use the cached file, and this is true of using this CSV or not. I'm also more in the camp that the PDF test files should be vendored into the repo and that while it makes working with the repo a bit more annoying due to cloning being longer, in the long run it simplifies life. PDF.js is prior art on this, where they have 1000+ pdfs in their repo and it's working fine for them.

  • Efficiency: We can heavily parallelize downloads. That might lead to speed-ups.

Is the plan to have a pre-test step that downloads all files at once, and then kicks off the test runner, so that the get_data_from_url function always has a cache hit?

Some other thoughts on other points:

Cross-project use: Other projects might be interested in that CSV and use it for their tests.

Who and why? Unless there's a known use case right now, I would not style the decision of how to format this file around something nebulous, as it's equally (maybe moreso?) possible that no one else uses this CSV, and you make some decision that makes pypdf life more annoying for nothing. There's also a good chance that when some other project does want to re-use this CSV, they still need some change in how it works due to their own requirements.

Switching the arguments of get_data_from_url: name first, without a default. Maybe even removing the URL part completely.

Not sure why you wouldn't remove the url from the argument altogether. The name is the identifier that ties it to the URL via this proposed file, so having the URL at all introduces pointless redundancy, which kind of violates your 4th stated goal of doing this.

I generally prefer YAML due to the readability.

I agree with this, though I would say I am way stronger on the belief CSV is a very bad format due to its lack of standardization and should use JSON/YAML over it. In my experience, if you add any sort of complexity to your document, it ends up breaking in dumb ways. Not to mention the difference in how different OSes/programs may open this format and end up breaking it. Stick to formats that have a proper deserialization/serialization standardization (be it JSON or YAML).

Unfortunately, YAML does not yet have stdlib support. YAML and JSON are quite similar (YAML is a superset of YAML) and might require escaping some characters, while CSV files should not require escaping in most cases (only if the data contains commas) and thus making it easier to work with. For this reason, I probably would stay with CSV for now.

This project has pre-commit as a dev dependency, and that has pyyaml as a dependency, so if you have setup this project locally and are in a position to run the tests (having installed the dev dependencies), then pyyaml is available. I think can just make pyyaml a top-level dev dependency and use it. Even if it wasn't, I don't think should be afraid to add a new dev dependency and stick with something just for that reason.

  • I do not think that we should "abuse" the pytest -k test filtering switch to do development tasks. This should be covered by a dedicated script if actually required.

Agreed. Abusing pytest as a general script runner is not good usage, and I don't think any developer would think to look there first for that kind of thing. There's already a Makefile, just use that for these sorts of dev scripts runners.

  • I do not think that we should start implementing an own plugin or own flags for pytest for convenience functions to generate/show metadata most developers will not even need. Just searching/grepping for a filename inside the source should be sufficient to determine where a test file is being used.

Agreed, I don't think the file should contain a list of tests that use the file, as it just duplicates info that's easy enough to grep for. If it's hard to find the name of a file in the codebase, that would just mean the filename is poor and not unique enough, and just change the filename.

@stefan6419846
Copy link
Collaborator

stefan6419846 commented Dec 2, 2023

Is the plan to have a pre-test step that downloads all files at once, and then kicks off the test runner, so that the get_data_from_url function always has a cache hit?

CI currently calls the download method which is used in this PR before running the actual tests:

python -c "from tests import download_test_pdfs; download_test_pdfs()"

I'm also more in the camp that the PDF test files should be vendored into the repo

This depends on how you see it, how distribution is done etc. The first thing is that most test PDFs have no clean copyright status. If included inside the sdist for example, this makes license compliance hard and bloats releases. pdf.js does not vendor their PDF files inside the releases and I am not sure how they deal with the copyright stuff in general.

@MartinThoma
Copy link
Member Author

MartinThoma commented Dec 3, 2023

Hey, thank you all for the feedback :-)

I'll try to group it :-)

File format

I was using CSV for two reasons:

  1. It's minimal
  2. Github displays it in a nice way: https://github.com/py-pdf/pypdf/blob/1f5ed085ff756eb879e3a49bb296b60a269b523c/tests/example_files.csv

However, I don't have strong feelings about it. I'll change it to YAML as you seem to prefer it. I do like the option to add comments.

Additional information in the example_files.yaml

  • We can / should add comments
  • I don't think we should add where it's used. That can easily be found by searching for the name (and we should use names that are long enough to not lead to many false-positives)

Why does the PR give a speed-up / which problem does it solve?

We can download files in parallel: 1f5ed08

Yes, it does not fix the issue that we might get network problems. But then the network problems occur in one place only. They failure will be only in the part of the tests that deal with downloading the files, not somewhere randomly in the code. That makes it easier to see where the issue comes from.

We could potentially even run it as its own CI step - all other parts would only start once all files are downloaded properly.

Is the plan to have a pre-test step that downloads all files at once, and then kicks off the test runner, so that the get_data_from_url function always has a cache hit?

@MasterOdin We already do that: https://github.com/py-pdf/pypdf/blob/main/.github/workflows/github-ci.yaml#L45-L47
It's quite a bit of tedious work to add all files in there, but that is where I hope to go :-)

Adding PDFs directly the the repo

It is true that this would be the simplest solution, but it has two massive disadvantages:

  1. Legal issues: I will not add any PDF where I don't know if I'm allowed to do so.
  2. Cloning size: I will not make the pypdf repository huge. I want to enable people to get the latest main with as little effort as possible. Hence I've created the sample-files repository.

While I'm the maintainer, I will not change this 3-split:

  • PDFs in pypdf: Very few which are rather small. Just for very basic testing
  • sample-files: Files for which we have the permission
  • third-party download: All the rest - when it's publicly available on the internet I assume that downloading it on demand is fine, but adding it permanently / hosting it is a very different story. I would like to get rid of that category, but I don't see that happening.

PR completeness

Switching the arguments of get_data_from_url: name first, without a default. Maybe even removing the URL part completely.

Not sure why you wouldn't remove the url from the argument altogether.

@MasterOdin This is just an intermediate step. I want to get there, but I don't have the time to go over the ~300 lines and do it. Merging those changes step-by-step + having a discussion beforehand + ensuring that new PRs use the example_files.yaml makes it easier :-)

I want this PR to contain the complicated parts / have an agreed structure. The single lines in the tests can be done in other PRs (maybe even by new contributors? 🤞 )

@MartinThoma MartinThoma force-pushed the centralize-file-downloads branch from 5c4ef6f to a5d7e4e Compare December 3, 2023 15:50
@MartinThoma MartinThoma force-pushed the centralize-file-downloads branch from a5d7e4e to 003f2c6 Compare December 3, 2023 16:20
@MartinThoma MartinThoma merged commit 4aae547 into main Dec 9, 2023
14 checks passed
@MartinThoma MartinThoma deleted the centralize-file-downloads branch December 9, 2023 11:54
MartinThoma added a commit that referenced this pull request Dec 10, 2023
## What's new

### Bug Fixes (BUG)
-  Cope with deflated images with CMYK Black Only (#2322) by @pubpub-zz
-  Handle indirect objects as parameters for CCITTFaxDecode (#2307) by @stefan6419846
-  check words length in _cmap type1_alternative function (#2310) by @Takher

### Robustness (ROB)
-  Relax flate decoding for too many lookup values (#2331) by @stefan6419846
-  Let _build_destination skip in case of missing /D key (#2018) by @nickryand

### Documentation (DOC)
-  Note in reading form data (#2338) by @MartinThoma
-  Pull Request prefixes and size by @MartinThoma
-  Add https://github.com/zuypt for #2325 as a contributor by @MartinThoma
-  Fix docstring for RunLengthDecode.decode (#2302) by @stefan6419846

### Maintenance (MAINT)
-  Enable `disallow_any_generics` and add missing generics (#2278) by @nilehmann

### Testing (TST)
-  Centralize file downloads (#2324) by @MartinThoma

### Code Style (STY)
-  Fix typo "steam" \xe2\x86\x92 "stream" (#2327) by @stefan6419846
-  Run black by @MartinThoma
-  Make Traceback in bug report template uppercase (#2304) by @stefan6419846

[Full Changelog](3.17.1...3.17.2)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants