Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

only version 1.8.0 robots.txt forbidden #4145

Closed
HisakaKoji opened this issue Nov 12, 2019 · 2 comments
Closed

only version 1.8.0 robots.txt forbidden #4145

HisakaKoji opened this issue Nov 12, 2019 · 2 comments

Comments

@HisakaKoji
Copy link

Description

When I scrapyied the url "https://www.walkerplus.com/" with scrapy ,
it is OK in version 1.7 , but it is forbidden by robots.txt in version 1.8.

The robots.txt
`user-agent: *
disallow: http://ms-web00.walkerplus.com/
disallow: http://www-origin.walkerplus.com/
disallow: http://walkerplus.jp/
disallow: http://walkerplus.net/
disallow: https://ms.walkerplus.com/

user-agent: twitterbot
disallow:`

Steps to Reproduce

  1. scrapy "https://www.walkerplus.com/" with version1.8 scrapy

Expected behavior: [What you expect to happen]

It is not forbidden.

Actual behavior: [What actually happens]

It is forbidden.

Reproduces how often: [What percentage of the time does it reproduce?]

It is 100% reproduced.

Versions

Scrapy : 1.8.0
lxml : 4.4.1.0
libxml2 : 2.9.9
cssselect : 1.1.0
parsel : 1.5.2
w3lib : 1.21.0
Twisted : 19.10.0
Python : 3.6.8 (default, Oct 7 2019, 12:59:55) - [GCC 8.3.0]
pyOpenSSL : 19.0.0 (OpenSSL 1.1.1 11 Sep 2018)
cryptography : 2.1.4
Platform : Linux-3.10.0-1062.4.1.el7.x86_64-x86_64-with-Ubuntu-18.04-bionic

@Gallaecio
Copy link
Member

It looks like a bug in Protego, our new default robots.txt parser.

In the meantime, you can switch to a different robots.txt parser.

@Gallaecio
Copy link
Member

Protego 0.1.16 fixes this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants