We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When I scrapyied the url "https://www.walkerplus.com/" with scrapy , it is OK in version 1.7 , but it is forbidden by robots.txt in version 1.8.
The robots.txt `user-agent: * disallow: http://ms-web00.walkerplus.com/ disallow: http://www-origin.walkerplus.com/ disallow: http://walkerplus.jp/ disallow: http://walkerplus.net/ disallow: https://ms.walkerplus.com/
user-agent: twitterbot disallow:`
Expected behavior: [What you expect to happen]
It is not forbidden.
Actual behavior: [What actually happens]
It is forbidden.
Reproduces how often: [What percentage of the time does it reproduce?]
It is 100% reproduced.
Scrapy : 1.8.0 lxml : 4.4.1.0 libxml2 : 2.9.9 cssselect : 1.1.0 parsel : 1.5.2 w3lib : 1.21.0 Twisted : 19.10.0 Python : 3.6.8 (default, Oct 7 2019, 12:59:55) - [GCC 8.3.0] pyOpenSSL : 19.0.0 (OpenSSL 1.1.1 11 Sep 2018) cryptography : 2.1.4 Platform : Linux-3.10.0-1062.4.1.el7.x86_64-x86_64-with-Ubuntu-18.04-bionic
The text was updated successfully, but these errors were encountered:
It looks like a bug in Protego, our new default robots.txt parser.
In the meantime, you can switch to a different robots.txt parser.
Sorry, something went wrong.
Protego 0.1.16 fixes this issue.
No branches or pull requests
Description
When I scrapyied the url "https://www.walkerplus.com/" with scrapy ,
it is OK in version 1.7 , but it is forbidden by robots.txt in version 1.8.
The robots.txt
`user-agent: *
disallow: http://ms-web00.walkerplus.com/
disallow: http://www-origin.walkerplus.com/
disallow: http://walkerplus.jp/
disallow: http://walkerplus.net/
disallow: https://ms.walkerplus.com/
user-agent: twitterbot
disallow:`
Steps to Reproduce
Expected behavior: [What you expect to happen]
It is not forbidden.
Actual behavior: [What actually happens]
It is forbidden.
Reproduces how often: [What percentage of the time does it reproduce?]
It is 100% reproduced.
Versions
Scrapy : 1.8.0
lxml : 4.4.1.0
libxml2 : 2.9.9
cssselect : 1.1.0
parsel : 1.5.2
w3lib : 1.21.0
Twisted : 19.10.0
Python : 3.6.8 (default, Oct 7 2019, 12:59:55) - [GCC 8.3.0]
pyOpenSSL : 19.0.0 (OpenSSL 1.1.1 11 Sep 2018)
cryptography : 2.1.4
Platform : Linux-3.10.0-1062.4.1.el7.x86_64-x86_64-with-Ubuntu-18.04-bionic
The text was updated successfully, but these errors were encountered: