All Internet Archive links, used in reference testing, are broken #8920

Snuffleupagus · 2017-09-17T11:56:08Z

Apparently the Internet Archive, which we depend on for a very large number of (linked) reference test-cases, has recently changed how they serve PDF files.

Previously, a URL such as http://web.archive.org/web/20160112115354/http://www.fao.org/fileadmin/user_upload/tci/docs/2_About%20Stacks.pdf would return a PDF file directly. However, now a HTML file is returned instead (which then points to the actual PDF file).
For someone cloning the PDF.js repo, and attempting to set-up testing for the first time, this means that all linked test-cases will now fail. Furthermore, it also means that we cannot use the Internet Archive when adding new test-cases.

Since the HTML file returned does contain a direct link to the PDF file, embedded in an <iframe> tag, we could perhaps add special-casing for Internet Archive URLs in test/downloadutils.js, such that the HTML file is first downloaded and parsed to obtain a direct PDF link.

The text was updated successfully, but these errors were encountered:

timvandermeij · 2017-09-17T13:15:13Z

It looks like we can just update all links to add if_ after the ID in the URL. For example, this is a current linked test case (issue1127.pdf.link):

https://web.archive.org/web/20160114105739/https://vmp.ethz.ch/pdfs/diplome/vordiplome/Block%201/Algorithmen_%26_Komplexitaet/AlgoKo_f08_Aufg.pdf

This can be changed to:

https://web.archive.org/web/20160114105739if_/https://vmp.ethz.ch/pdfs/diplome/vordiplome/Block%201/Algorithmen_%26_Komplexitaet/AlgoKo_f08_Aufg.pdf

to get the direct link.

I have tested this with a number of linked test cases and it worked for all of them.

Snuffleupagus · 2017-09-17T13:30:21Z

It looks like we can just update all links to add if_ after the ID in the URL. For example, this is a current linked test case (issue1127.pdf.link):

I've noticed that as well, but I'm not sure if we really want to assume that the format will be constant in the future (it could potentially even depend on e.g. load balancing or similar things). I'd hate for us to do search-and-replace on all *.pdf.link files now, just to have to redo it in a couple of months time.

timvandermeij · 2017-09-17T13:34:46Z

Correct, but doesn't the same apply for HTML parsing? If the markup changes, we may get the same problem. Personally I'm fine with either solution by the way, because I'm really on the fence about which one is faster/more future-proof.

timvandermeij · 2017-09-17T13:40:19Z

There is also fallback code at https://github.com/mozilla/pdf.js/blob/master/test/downloadutils.js#L40 which I assume does not work anymore as well. I think we should remove that because silently handling errors does not sound right in the context of testing. I'd rather have things fail loudly so we know there is an error and can fix it properly.

Since we have Internet Archive-specific code there anyway, I think I'm fine with handling this over there. Let's try to consolidate the code in one block as much as possible.

Snuffleupagus · 2017-09-17T13:48:25Z

I'd rather have things fail loudly so we know there is an error and can fix it properly.

Just FYI: I attempted to implement something along those lines in PR #7947, but the resulting discussion indicated that failing hard wasn't really desirable behaviour here.

timvandermeij · 2017-09-17T13:59:55Z

I have one more idea for this. It's a bit of a hybrid approach for the two solutions. How about in test/downloadutils.js we detect that we are dealing with an Internet Archive URL and perform the if_ transformation there? That way we don't have to touch the link files (search/replace) and can easily adjust the code if the Internet Archive were to change its format again (or implement HTML parsing there later on if it happens often)? It will be quick and keep the option for HTML parsing open (while we avoid it for now).

Snuffleupagus · 2017-09-26T19:44:57Z

@timvandermeij #8920 (comment) seems like a good compromise for now! Time permitting, do you want to take a stab at implementing it (since the current state isn't great when setting up a dev-environment)?

timvandermeij · 2017-09-26T20:59:17Z

Yes, I'm hoping I can take a look at this before or during the weekend.

Snuffleupagus added test regression labels Sep 17, 2017

timvandermeij mentioned this issue Sep 30, 2017

Transform Web Archive URLs to avoid downloading an HTML page instead of the PDF file #8979

Merged

Snuffleupagus closed this as completed in #8979 Sep 30, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

All Internet Archive links, used in reference testing, are broken #8920

All Internet Archive links, used in reference testing, are broken #8920

Snuffleupagus commented Sep 17, 2017 •

edited

Loading

timvandermeij commented Sep 17, 2017 •

edited

Loading

Snuffleupagus commented Sep 17, 2017

timvandermeij commented Sep 17, 2017

timvandermeij commented Sep 17, 2017 •

edited

Loading

Snuffleupagus commented Sep 17, 2017

timvandermeij commented Sep 17, 2017 •

edited

Loading

Snuffleupagus commented Sep 26, 2017 •

edited

Loading

timvandermeij commented Sep 26, 2017 •

edited

Loading

All Internet Archive links, used in reference testing, are broken #8920

All Internet Archive links, used in reference testing, are broken #8920

Comments

Snuffleupagus commented Sep 17, 2017 • edited Loading

timvandermeij commented Sep 17, 2017 • edited Loading

Snuffleupagus commented Sep 17, 2017

timvandermeij commented Sep 17, 2017

timvandermeij commented Sep 17, 2017 • edited Loading

Snuffleupagus commented Sep 17, 2017

timvandermeij commented Sep 17, 2017 • edited Loading

Snuffleupagus commented Sep 26, 2017 • edited Loading

timvandermeij commented Sep 26, 2017 • edited Loading

Snuffleupagus commented Sep 17, 2017 •

edited

Loading

timvandermeij commented Sep 17, 2017 •

edited

Loading

timvandermeij commented Sep 17, 2017 •

edited

Loading

timvandermeij commented Sep 17, 2017 •

edited

Loading

Snuffleupagus commented Sep 26, 2017 •

edited

Loading

timvandermeij commented Sep 26, 2017 •

edited

Loading