Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Respecting robots.txt files #1496

Closed
TechnologyClassroom opened this issue Aug 8, 2024 · 2 comments
Closed

Respecting robots.txt files #1496

TechnologyClassroom opened this issue Aug 8, 2024 · 2 comments

Comments

@TechnologyClassroom
Copy link

TechnologyClassroom commented Aug 8, 2024

I looked and I did not see anything about robots.txt files in the issues.

I see web traffic on one of the servers I manage claiming to be a selfoss instance which is scraping requesting /wiki/Special: pages. Our robots.txt file explicitly disallows robots from scraping those pages.

Disallow: /wiki/Special:

Is this an issue with selfoss or is this not a selfoss instance?

I would be happy to supply some redacted logs over email if it would help.

Edit: Replacing scraping with requesting.

@jtojnar
Copy link
Member

jtojnar commented Aug 8, 2024

Hi, as a feed reader, selfoss does not crawl the web – it only periodically fetches URLs of the feeds that user provides. As such, I would say following robots.txt makes only slightly more sense than it would for a read-it-later app or a web browser.

So if selfoss is hitting a page, it most likely means a user configured it to do so.

There is also a chance that user specified your homepage as the source URL and, since it is not a feed, SimplePie library’s smart feed discovery picks a special link from the page for some reason.

Feel free to send me the logs to [email protected], I can take a look.

@TechnologyClassroom
Copy link
Author

Ah, I see the pattern. It is one person doing these two requests at 30 minute or hourly intervals repeating.

directory.fsf.org:80 REDACTED - - [08/Aug/2024:19:30:01 -0400] "GET /wiki/Special:Ask/-5B-5BLast-20review-20date::%2B-5D-5D/format%3Drss/sort%3DLast-20review-20date/order%3Ddescending/searchlabel%3DRecent-20updates-20RSS-20feed/title%3DFree-20Software-20Directory/description%3DRecent-20updates-20to-20Free-20Software-20Directory-20(directory.fsf.org)/offset%3D0 HTTP/1.1" 301 1237 "http://directory.fsf.org/wiki/Special:Ask/-5B-5BLast-20review-20date::%2B-5D-5D/format%3Drss/sort%3DLast-20review-20date/order%3Ddescending/searchlabel%3DRecent-20updates-20RSS-20feed/title%3DFree-20Software-20Directory/description%3DRecent-20updates-20to-20Free-20Software-20Directory-20(directory.fsf.org)/offset%3D0" "Selfoss/2.19 (+https://selfoss.aditu.de)"
directory.fsf.org:80 REDACTED - - [08/Aug/2024:19:30:08 -0400] "GET /wiki/Special:Ask/-5B-5BSubmitted-20date::+-5D-5D/format=rss/sort=Submitted-20date/order=descending/searchlabel=New-20packages-20RSS-20feed/title=Free-20Software-20Directory/description=Recent-20updates-20to-20Free-20Software-20Directory-20%28directory.fsf.org%29 HTTP/1.1" 301 1199 "http://directory.fsf.org/wiki/Special:Ask/-5B-5BSubmitted-20date::+-5D-5D/format=rss/sort=Submitted-20date/order=descending/searchlabel=New-20packages-20RSS-20feed/title=Free-20Software-20Directory/description=Recent-20updates-20to-20Free-20Software-20Directory-20%28directory.fsf.org%29" "Selfoss/2.19 (+https://selfoss.aditu.de)"

I mistook it for crawling. That's fine at that scale. If your program gets really popular, I'll come back to request a robots.txt file feature.

Thanks for getting back to me! Closing the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants