WIP: Start hosts file with reg-ex-ed domains from #11

katrinleinweber · 2018-08-07T08:00:36Z

Hi! Following the suggestions in #9 and StevenBlack/hosts#720, I wanted to test whether this works for subscribing to such a file in adblockers. Yes, it does:

So, this PR suggests to add a hosts file to /_data, which I manually merged from domains I extracted from the journals & publishers lists. It probably contains a few false-positives still, so please don't merge until we discussed how to auto-generate such a file if it is desired.

For now, it can be tested though by adding this URL https://raw.githubusercontent.com/stop-predatory-journals/stop-predatory-journals.github.io/0e64ce25fa147df7d0a79660ec91526c6436eb88/_data/hosts to adblockers like uBlock. In case some do not support this hosts file format, we can find a more widely supported format and update this PR.

predatoryjournals.com/journals & predatoryjournals.com/publishers

lucboruta · 2018-08-25T16:44:08Z

Hi, great idea!

I'm interested in flagging URLs/URIs from predatory journals on Cobaltmetrics.com. I was about to generate a list of hosts too, good thing I checked the PRs.

I reviewed your list, and I have one suggestion regarding false positives. Some hosts host multiple journals, and I think there are cases where we don't want to block all URLs from a given host (does one bad apple spoil the whole bunch?).

For example, journals.csv includes http://journals.sfu.ca/africanem/index.php/ajtcam/index (cf. line 20), but I don't think we want to include journals.sfu.ca in the list of hosts (cf. line 1581).

What about focusing first on empty paths, / (HTTP defines an empty path to be equivalent to / anyway) and a few obvious root-like paths, e.g. anything that matches ^/(default|index|home)\.(aspx|html?|php)$ in a case-insensitive way?

Maybe also add the constraint that URLs should have no query component to be included in the list? To avoid false positives when an acceptable host hosts multiple journals, and the name of the journal is given in the query string, e.g. https://goodhost/index.php?journal=badjournal.

katrinleinweber · 2018-08-27T09:18:05Z

Good point! Using domains for blocking is rather coarse, and probably too broad. Since there has been no reaction from @stoppredatoryjournals, I guess this can be closed as out of scope.

Maybe a better approach would be to PR a conversion pipeline from _data/*.csv to an adblocker-compatible file format.

Could such a pipeline then handle the don't-block-after-all features you mention?

lucboruta · 2018-08-29T09:55:07Z

I don't know much about the internals of adblockers, e.g. if the biggest adblockers use the same syntax for their filters, but I eyeballed a few lists from https://filterlists.com/, and including paths (rather than just domains and hosts) seems possible.

In any case, yes, if the list you want to build is derived from the "main" lists, I think the code would be more valuable than the result.

lucboruta · 2018-08-29T10:17:50Z

Oh, and I extracted the set of paths from all URLs in _data/*.csv, lowercased everything and filtered out paths that contain acronyms or what looked like site- or journal-specific information. Here's the Gist: https://gist.github.com/lucboruta/0ea6ab3adac42f8eba6237ee9847c308

The list isn't very long, but there is more variation than I expected. We can't know for sure whether we can block the whole domain without some kind of manual validation.

katrinleinweber added 3 commits August 7, 2018 09:47

Start hosts file with reg-ex-ed domains from ... (see #9)

3a8c65e

predatoryjournals.com/journals & predatoryjournals.com/publishers

Remove 2 obvious false positives

ce279d2

Remove false positive

0e64ce2

katrinleinweber mentioned this pull request Aug 7, 2018

journals.csv contains some titles as URLs #12

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Start hosts file with reg-ex-ed domains from #11

WIP: Start hosts file with reg-ex-ed domains from #11

katrinleinweber commented Aug 7, 2018 •

edited

Loading

lucboruta commented Aug 25, 2018

katrinleinweber commented Aug 27, 2018

lucboruta commented Aug 29, 2018

lucboruta commented Aug 29, 2018

WIP: Start hosts file with reg-ex-ed domains from #11

Are you sure you want to change the base?

WIP: Start hosts file with reg-ex-ed domains from #11

Conversation

katrinleinweber commented Aug 7, 2018 • edited Loading

lucboruta commented Aug 25, 2018

katrinleinweber commented Aug 27, 2018

lucboruta commented Aug 29, 2018

lucboruta commented Aug 29, 2018

katrinleinweber commented Aug 7, 2018 •

edited

Loading