-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
All Internet Archive links, used in reference testing, are broken #8920
Comments
It looks like we can just update all links to add This can be changed to: to get the direct link. I have tested this with a number of linked test cases and it worked for all of them. |
I've noticed that as well, but I'm not sure if we really want to assume that the format will be constant in the future (it could potentially even depend on e.g. load balancing or similar things). I'd hate for us to do search-and-replace on all |
Correct, but doesn't the same apply for HTML parsing? If the markup changes, we may get the same problem. Personally I'm fine with either solution by the way, because I'm really on the fence about which one is faster/more future-proof. |
There is also fallback code at https://github.com/mozilla/pdf.js/blob/master/test/downloadutils.js#L40 which I assume does not work anymore as well. I think we should remove that because silently handling errors does not sound right in the context of testing. I'd rather have things fail loudly so we know there is an error and can fix it properly. Since we have Internet Archive-specific code there anyway, I think I'm fine with handling this over there. Let's try to consolidate the code in one block as much as possible. |
Just FYI: I attempted to implement something along those lines in PR #7947, but the resulting discussion indicated that failing hard wasn't really desirable behaviour here. |
I have one more idea for this. It's a bit of a hybrid approach for the two solutions. How about in |
@timvandermeij #8920 (comment) seems like a good compromise for now! Time permitting, do you want to take a stab at implementing it (since the current state isn't great when setting up a dev-environment)? |
Yes, I'm hoping I can take a look at this before or during the weekend. |
Apparently the Internet Archive, which we depend on for a very large number of (linked) reference test-cases, has recently changed how they serve PDF files.
Previously, a URL such as http://web.archive.org/web/20160112115354/http://www.fao.org/fileadmin/user_upload/tci/docs/2_About%20Stacks.pdf would return a PDF file directly. However, now a HTML file is returned instead (which then points to the actual PDF file).
For someone cloning the PDF.js repo, and attempting to set-up testing for the first time, this means that all linked test-cases will now fail. Furthermore, it also means that we cannot use the Internet Archive when adding new test-cases.
Since the HTML file returned does contain a direct link to the PDF file, embedded in an
<iframe>
tag, we could perhaps add special-casing for Internet Archive URLs in test/downloadutils.js, such that the HTML file is first downloaded and parsed to obtain a direct PDF link.The text was updated successfully, but these errors were encountered: