-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect infinite baseUrl explosions #42
Comments
This was encountered a second time with an endlessly repeating / One potential solution is to check the referer's path, then the current URL's path. If the difference between them repeats 1-n times, consider it a dead-end looping URL. That won't help prevent the "first-tier" explosion of bad URLs, but can help us avoid following them infinitely-deep. |
Simple checking of repeated URL elements is now in place and will ship with the 0.9.19 release; it catches the |
Fun twist:
Results in one of the links including the following URL:
And visiting it results in one of the links including:
Right now our recursion detector assumes that the duplication will only occur at the end of the URL; this needs to be reassessed. |
In some situations where dynamic pages are generated based on the incoming URL, and links can be entered without a proper protocol, it's easy for some CMSs to generate infinite exploding URL trees. For example:
We need to figure out if there's a good way to detect these scenarios; even a brute force check is probably preferable to a hung crawl or (worse) a dataset that's trashed and has to be re-crawled.
The text was updated successfully, but these errors were encountered: