-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Start hosts file with reg-ex-ed domains from #11
base: master
Are you sure you want to change the base?
WIP: Start hosts file with reg-ex-ed domains from #11
Conversation
predatoryjournals.com/journals & predatoryjournals.com/publishers
Hi, great idea! I'm interested in flagging URLs/URIs from predatory journals on Cobaltmetrics.com. I was about to generate a list of hosts too, good thing I checked the PRs. I reviewed your list, and I have one suggestion regarding false positives. Some hosts host multiple journals, and I think there are cases where we don't want to block all URLs from a given host (does one bad apple spoil the whole bunch?). For example, What about focusing first on empty paths, Maybe also add the constraint that URLs should have no query component to be included in the list? To avoid false positives when an acceptable host hosts multiple journals, and the name of the journal is given in the query string, e.g. |
Good point! Using domains for blocking is rather coarse, and probably too broad. Since there has been no reaction from @stoppredatoryjournals, I guess this can be closed as out of scope. Maybe a better approach would be to PR a conversion pipeline from Could such a pipeline then handle the don't-block-after-all features you mention? |
I don't know much about the internals of adblockers, e.g. if the biggest adblockers use the same syntax for their filters, but I eyeballed a few lists from https://filterlists.com/, and including paths (rather than just domains and hosts) seems possible. In any case, yes, if the list you want to build is derived from the "main" lists, I think the code would be more valuable than the result. |
Oh, and I extracted the set of paths from all URLs in The list isn't very long, but there is more variation than I expected. We can't know for sure whether we can block the whole domain without some kind of manual validation. |
Hi! Following the suggestions in #9 and StevenBlack/hosts#720, I wanted to test whether this works for subscribing to such a file in adblockers. Yes, it does:
So, this PR suggests to add a
hosts
file to/_data
, which I manually merged from domains I extracted from the journals & publishers lists. It probably contains a few false-positives still, so please don't merge until we discussed how to auto-generate such a file if it is desired.For now, it can be tested though by adding this URL
https://raw.githubusercontent.com/stop-predatory-journals/stop-predatory-journals.github.io/0e64ce25fa147df7d0a79660ec91526c6436eb88/_data/hosts
to adblockers like uBlock. In case some do not support thishosts
file format, we can find a more widely supported format and update this PR.