Respecting robots.txt files #1496

TechnologyClassroom · 2024-08-08T14:04:46Z

I looked and I did not see anything about robots.txt files in the issues.

I see web traffic on one of the servers I manage claiming to be a selfoss instance which is ~~scraping~~ requesting /wiki/Special: pages. Our robots.txt file explicitly disallows robots from scraping those pages.

Disallow: /wiki/Special:

Is this an issue with selfoss or is this not a selfoss instance?

I would be happy to supply some redacted logs over email if it would help.

Edit: Replacing scraping with requesting.

The text was updated successfully, but these errors were encountered:

jtojnar · 2024-08-08T21:18:34Z

Hi, as a feed reader, selfoss does not crawl the web – it only periodically fetches URLs of the feeds that user provides. As such, I would say following robots.txt makes only slightly more sense than it would for a read-it-later app or a web browser.

So if selfoss is hitting a page, it most likely means a user configured it to do so.

There is also a chance that user specified your homepage as the source URL and, since it is not a feed, SimplePie library’s smart feed discovery picks a special link from the page for some reason.

Feel free to send me the logs to [email protected], I can take a look.

TechnologyClassroom · 2024-08-09T00:37:33Z

Ah, I see the pattern. It is one person doing these two requests at 30 minute or hourly intervals repeating.

directory.fsf.org:80 REDACTED - - [08/Aug/2024:19:30:01 -0400] "GET /wiki/Special:Ask/-5B-5BLast-20review-20date::%2B-5D-5D/format%3Drss/sort%3DLast-20review-20date/order%3Ddescending/searchlabel%3DRecent-20updates-20RSS-20feed/title%3DFree-20Software-20Directory/description%3DRecent-20updates-20to-20Free-20Software-20Directory-20(directory.fsf.org)/offset%3D0 HTTP/1.1" 301 1237 "http://directory.fsf.org/wiki/Special:Ask/-5B-5BLast-20review-20date::%2B-5D-5D/format%3Drss/sort%3DLast-20review-20date/order%3Ddescending/searchlabel%3DRecent-20updates-20RSS-20feed/title%3DFree-20Software-20Directory/description%3DRecent-20updates-20to-20Free-20Software-20Directory-20(directory.fsf.org)/offset%3D0" "Selfoss/2.19 (+https://selfoss.aditu.de)"
directory.fsf.org:80 REDACTED - - [08/Aug/2024:19:30:08 -0400] "GET /wiki/Special:Ask/-5B-5BSubmitted-20date::+-5D-5D/format=rss/sort=Submitted-20date/order=descending/searchlabel=New-20packages-20RSS-20feed/title=Free-20Software-20Directory/description=Recent-20updates-20to-20Free-20Software-20Directory-20%28directory.fsf.org%29 HTTP/1.1" 301 1199 "http://directory.fsf.org/wiki/Special:Ask/-5B-5BSubmitted-20date::+-5D-5D/format=rss/sort=Submitted-20date/order=descending/searchlabel=New-20packages-20RSS-20feed/title=Free-20Software-20Directory/description=Recent-20updates-20to-20Free-20Software-20Directory-20%28directory.fsf.org%29" "Selfoss/2.19 (+https://selfoss.aditu.de)"

I mistook it for crawling. That's fine at that scale. If your program gets really popular, I'll come back to request a robots.txt file feature.

Thanks for getting back to me! Closing the issue.

TechnologyClassroom closed this as completed Aug 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Respecting robots.txt files #1496

Respecting robots.txt files #1496

TechnologyClassroom commented Aug 8, 2024 •

edited

Loading

jtojnar commented Aug 8, 2024

TechnologyClassroom commented Aug 9, 2024

Respecting robots.txt files #1496

Respecting robots.txt files #1496

Comments

TechnologyClassroom commented Aug 8, 2024 • edited Loading

jtojnar commented Aug 8, 2024

TechnologyClassroom commented Aug 9, 2024

TechnologyClassroom commented Aug 8, 2024 •

edited

Loading