Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All Internet Archive links, used in reference testing, are broken #8920

Closed
Snuffleupagus opened this issue Sep 17, 2017 · 8 comments
Closed

Comments

@Snuffleupagus
Copy link
Collaborator

Snuffleupagus commented Sep 17, 2017

Apparently the Internet Archive, which we depend on for a very large number of (linked) reference test-cases, has recently changed how they serve PDF files.

Previously, a URL such as http://web.archive.org/web/20160112115354/http://www.fao.org/fileadmin/user_upload/tci/docs/2_About%20Stacks.pdf would return a PDF file directly. However, now a HTML file is returned instead (which then points to the actual PDF file).
For someone cloning the PDF.js repo, and attempting to set-up testing for the first time, this means that all linked test-cases will now fail. Furthermore, it also means that we cannot use the Internet Archive when adding new test-cases.

Since the HTML file returned does contain a direct link to the PDF file, embedded in an <iframe> tag, we could perhaps add special-casing for Internet Archive URLs in test/downloadutils.js, such that the HTML file is first downloaded and parsed to obtain a direct PDF link.

@timvandermeij
Copy link
Contributor

timvandermeij commented Sep 17, 2017

It looks like we can just update all links to add if_ after the ID in the URL. For example, this is a current linked test case (issue1127.pdf.link):

https://web.archive.org/web/20160114105739/https://vmp.ethz.ch/pdfs/diplome/vordiplome/Block%201/Algorithmen_%26_Komplexitaet/AlgoKo_f08_Aufg.pdf

This can be changed to:

https://web.archive.org/web/20160114105739if_/https://vmp.ethz.ch/pdfs/diplome/vordiplome/Block%201/Algorithmen_%26_Komplexitaet/AlgoKo_f08_Aufg.pdf

to get the direct link.

I have tested this with a number of linked test cases and it worked for all of them.

@Snuffleupagus
Copy link
Collaborator Author

It looks like we can just update all links to add if_ after the ID in the URL. For example, this is a current linked test case (issue1127.pdf.link):

I've noticed that as well, but I'm not sure if we really want to assume that the format will be constant in the future (it could potentially even depend on e.g. load balancing or similar things). I'd hate for us to do search-and-replace on all *.pdf.link files now, just to have to redo it in a couple of months time.

@timvandermeij
Copy link
Contributor

Correct, but doesn't the same apply for HTML parsing? If the markup changes, we may get the same problem. Personally I'm fine with either solution by the way, because I'm really on the fence about which one is faster/more future-proof.

@timvandermeij
Copy link
Contributor

timvandermeij commented Sep 17, 2017

There is also fallback code at https://github.com/mozilla/pdf.js/blob/master/test/downloadutils.js#L40 which I assume does not work anymore as well. I think we should remove that because silently handling errors does not sound right in the context of testing. I'd rather have things fail loudly so we know there is an error and can fix it properly.

Since we have Internet Archive-specific code there anyway, I think I'm fine with handling this over there. Let's try to consolidate the code in one block as much as possible.

@Snuffleupagus
Copy link
Collaborator Author

I'd rather have things fail loudly so we know there is an error and can fix it properly.

Just FYI: I attempted to implement something along those lines in PR #7947, but the resulting discussion indicated that failing hard wasn't really desirable behaviour here.

@timvandermeij
Copy link
Contributor

timvandermeij commented Sep 17, 2017

I have one more idea for this. It's a bit of a hybrid approach for the two solutions. How about in test/downloadutils.js we detect that we are dealing with an Internet Archive URL and perform the if_ transformation there? That way we don't have to touch the link files (search/replace) and can easily adjust the code if the Internet Archive were to change its format again (or implement HTML parsing there later on if it happens often)? It will be quick and keep the option for HTML parsing open (while we avoid it for now).

@Snuffleupagus
Copy link
Collaborator Author

Snuffleupagus commented Sep 26, 2017

@timvandermeij #8920 (comment) seems like a good compromise for now! Time permitting, do you want to take a stab at implementing it (since the current state isn't great when setting up a dev-environment)?

@timvandermeij
Copy link
Contributor

timvandermeij commented Sep 26, 2017

Yes, I'm hoping I can take a look at this before or during the weekend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants