Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect infinite baseUrl explosions #42

Open
eaton opened this issue Jan 10, 2023 · 3 comments
Open

Detect infinite baseUrl explosions #42

eaton opened this issue Jan 10, 2023 · 3 comments

Comments

@eaton
Copy link
Contributor

eaton commented Jan 10, 2023

In some situations where dynamic pages are generated based on the incoming URL, and links can be entered without a proper protocol, it's easy for some CMSs to generate infinite exploding URL trees. For example:

  1. http://example.com/event/{event-id} is used as a URL route
  2. Editors may create arbitrary sub-pages for the event, so any number of sub-urls are theoretically valid.
  3. Event ID parsing is permissive, discarding anything after the ID but not redirecting to the shorter, valid URL. For example, '1234' and '1234/foo' both render the page for '1234' but no redirection is performed.
  4. IF a malformed URL with no protocol is added to that page, browsers will treat the current page's URL as the base. For example, 'www.cnn.com' becomes 'http://example.com/event/1234/www.cnn.com';
  5. Following the malformed link, with the permissive ID parsing in effect, generates a new page that stacks the URL: 'http://example.com/event/1234/www.cnn.com/www.cnn.com'.
  6. Particularly when multiple protocol-less links are present, the size of the crawl explodes geometrically, quickly skewing page counts and bloating the dataset.

We need to figure out if there's a good way to detect these scenarios; even a brute force check is probably preferable to a hung crawl or (worse) a dataset that's trashed and has to be re-crawled.

@eaton
Copy link
Contributor Author

eaton commented Mar 30, 2023

This was encountered a second time with an endlessly repeating ///////~ URL; there are some expensive generalized solutions, and some error-prone heuristics; we might have to choose between them.

One potential solution is to check the referer's path, then the current URL's path. If the difference between them repeats 1-n times, consider it a dead-end looping URL. That won't help prevent the "first-tier" explosion of bad URLs, but can help us avoid following them infinitely-deep.

@eaton
Copy link
Contributor Author

eaton commented Apr 8, 2023

Simple checking of repeated URL elements is now in place and will ship with the 0.9.19 release; it catches the https://example.com/~/~/~ URLs, but isn't yet smart enough to identify the "missing protocol resulted in a full URL being appended to the baseURL" problem in the very first example that started this issue.

@eaton
Copy link
Contributor Author

eaton commented May 20, 2023

Fun twist: https://blog.education.nationalgeographic.org/2013/09/24/jerusalem-the-movie-explores-tolerance-themes shows off a special hellscape that generates infinitely expanding paths with an extra suffix. i.e., visiting:

/2013/09/24/jerusalem-the-movie-explores-tolerance-themes

Results in one of the links including the following URL:

/2013/09/24/jerusalem-the-movie-explores-tolerance-themes/NatGeoEd.org/Jerusalem

And visiting it results in one of the links including:

/2013/09/24/jerusalem-the-movie-explores-tolerance-themes/NatGeoEd.org/NatGeoEd.org/Jerusalem

Right now our recursion detector assumes that the duplication will only occur at the end of the URL; this needs to be reassessed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant