Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Start hosts file with reg-ex-ed domains from #11

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

WIP: Start hosts file with reg-ex-ed domains from #11

wants to merge 3 commits into from

Conversation

katrinleinweber
Copy link

@katrinleinweber katrinleinweber commented Aug 7, 2018

Hi! Following the suggestions in #9 and StevenBlack/hosts#720, I wanted to test whether this works for subscribing to such a file in adblockers. Yes, it does:

grafik

So, this PR suggests to add a hosts file to /_data, which I manually merged from domains I extracted from the journals & publishers lists. It probably contains a few false-positives still, so please don't merge until we discussed how to auto-generate such a file if it is desired.

For now, it can be tested though by adding this URL https://raw.githubusercontent.com/stop-predatory-journals/stop-predatory-journals.github.io/0e64ce25fa147df7d0a79660ec91526c6436eb88/_data/hosts to adblockers like uBlock. In case some do not support this hosts file format, we can find a more widely supported format and update this PR.

@lucboruta
Copy link

Hi, great idea!

I'm interested in flagging URLs/URIs from predatory journals on Cobaltmetrics.com. I was about to generate a list of hosts too, good thing I checked the PRs.

I reviewed your list, and I have one suggestion regarding false positives. Some hosts host multiple journals, and I think there are cases where we don't want to block all URLs from a given host (does one bad apple spoil the whole bunch?).

For example, journals.csv includes http://journals.sfu.ca/africanem/index.php/ajtcam/index (cf. line 20), but I don't think we want to include journals.sfu.ca in the list of hosts (cf. line 1581).

What about focusing first on empty paths, / (HTTP defines an empty path to be equivalent to / anyway) and a few obvious root-like paths, e.g. anything that matches ^/(default|index|home)\.(aspx|html?|php)$ in a case-insensitive way?

Maybe also add the constraint that URLs should have no query component to be included in the list? To avoid false positives when an acceptable host hosts multiple journals, and the name of the journal is given in the query string, e.g. https://goodhost/index.php?journal=badjournal.

@katrinleinweber
Copy link
Author

Good point! Using domains for blocking is rather coarse, and probably too broad. Since there has been no reaction from @stoppredatoryjournals, I guess this can be closed as out of scope.

Maybe a better approach would be to PR a conversion pipeline from _data/*.csv to an adblocker-compatible file format.

Could such a pipeline then handle the don't-block-after-all features you mention?

@lucboruta
Copy link

I don't know much about the internals of adblockers, e.g. if the biggest adblockers use the same syntax for their filters, but I eyeballed a few lists from https://filterlists.com/, and including paths (rather than just domains and hosts) seems possible.

In any case, yes, if the list you want to build is derived from the "main" lists, I think the code would be more valuable than the result.

@lucboruta
Copy link

Oh, and I extracted the set of paths from all URLs in _data/*.csv, lowercased everything and filtered out paths that contain acronyms or what looked like site- or journal-specific information. Here's the Gist: https://gist.github.com/lucboruta/0ea6ab3adac42f8eba6237ee9847c308

The list isn't very long, but there is more variation than I expected. We can't know for sure whether we can block the whole domain without some kind of manual validation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants