only version 1.8.0 robots.txt forbidden #4145

HisakaKoji · 2019-11-12T01:11:25Z

Description

When I scrapyied the url "https://www.walkerplus.com/" with scrapy ,
it is OK in version 1.7 , but it is forbidden by robots.txt in version 1.8.

The robots.txt
`user-agent: *
disallow: http://ms-web00.walkerplus.com/
disallow: http://www-origin.walkerplus.com/
disallow: http://walkerplus.jp/
disallow: http://walkerplus.net/
disallow: https://ms.walkerplus.com/

user-agent: twitterbot
disallow:`

Steps to Reproduce

scrapy "https://www.walkerplus.com/" with version1.8 scrapy

Expected behavior: [What you expect to happen]

It is not forbidden.

Actual behavior: [What actually happens]

It is forbidden.

Reproduces how often: [What percentage of the time does it reproduce?]

It is 100% reproduced.

Versions

Scrapy : 1.8.0
lxml : 4.4.1.0
libxml2 : 2.9.9
cssselect : 1.1.0
parsel : 1.5.2
w3lib : 1.21.0
Twisted : 19.10.0
Python : 3.6.8 (default, Oct 7 2019, 12:59:55) - [GCC 8.3.0]
pyOpenSSL : 19.0.0 (OpenSSL 1.1.1 11 Sep 2018)
cryptography : 2.1.4
Platform : Linux-3.10.0-1062.4.1.el7.x86_64-x86_64-with-Ubuntu-18.04-bionic

Gallaecio · 2019-11-12T06:06:32Z

It looks like a bug in Protego, our new default robots.txt parser.

In the meantime, you can switch to a different robots.txt parser.

Gallaecio · 2019-12-09T12:32:11Z

Protego 0.1.16 fixes this issue.

Gallaecio added bug upstream issue labels Nov 12, 2019

Gallaecio mentioned this issue Nov 12, 2019

Cannot fetch non-disallowed domain scrapy/protego#4

Closed

whalebot-helmsman mentioned this issue Dec 5, 2019

Enforce path only in Alllow/Disallow scrapy/protego#8

Merged

Gallaecio closed this as completed Dec 9, 2019

whalebot-helmsman mentioned this issue Dec 18, 2019

Use protego version with fix #4245

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

only version 1.8.0 robots.txt forbidden #4145

only version 1.8.0 robots.txt forbidden #4145

HisakaKoji commented Nov 12, 2019

Gallaecio commented Nov 12, 2019

Gallaecio commented Dec 9, 2019

only version 1.8.0 robots.txt forbidden #4145

only version 1.8.0 robots.txt forbidden #4145

Comments

HisakaKoji commented Nov 12, 2019

Description

Steps to Reproduce

Versions

Gallaecio commented Nov 12, 2019

Gallaecio commented Dec 9, 2019