Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidate this effort with the citizenlab/test-lists effort #1

Open
hellais opened this issue Oct 5, 2017 · 8 comments
Open

Consolidate this effort with the citizenlab/test-lists effort #1

hellais opened this issue Oct 5, 2017 · 8 comments

Comments

@hellais
Copy link

hellais commented Oct 5, 2017

It seems like this repo is ~= the same as https://github.com/citizenlab/test-lists/, but since they are separate github repositories its:

a. Hard for somebody to tell what is the difference between the two
b. Know how they can contribute to this project

What is the reason for not joining forces with citizenlab/test-lists and just contributing to the same dataset?

If it's a matter of adding some extra columns to the data, we can work something out.

@willscott
Copy link

The commits in this repo seem to all be automatically generated. That maybe implies there's software management behind this list, rather than pure manual curation.

@hellais
Copy link
Author

hellais commented Oct 6, 2017

@willscott from speaking to @jdcc it seems like it a bit of both. This test list actually has more URLs that are not inside of the CitizenLab and we should consolidate the two.

Probably what we can do is just have the integration scripts (that apparently pull in stuff from spreadsheets hosted elsewhere) run on the primary citizenlab test lists repo for all the community to benefit.

@hellais
Copy link
Author

hellais commented Nov 2, 2017

@jdcc any news on this front?

@jdcc
Copy link
Collaborator

jdcc commented Jan 11, 2018

Sorry I've been hard to reach on this - I don't usually see GitHub pings.

This is a separate project mostly because we want the flexibility our own set of lists affords. The fact that this repo is public is mostly a matter of convenience so our consumers don't need to deal with auth.

I think you're right that the README could lead to confusion. What do you think about a big blurb that just tells folks to contribute to the citizenlab repo?

I see these as the important differences between the repos (let me know what I get wrong):

  1. We have some URLs that have been added by our collaborators that aren't in the citizenlab repo.
  2. We're operating with different category lists.
  3. I think OONI does some simple screening of URLs as they're added to make sure they're not malicious or whatever. We don't do that.
  4. I think we do more aggressive URL normalization.

Most of these don't really matter as long as we're maintaining a separate repo, but there is real value in the additional URLs added by our collaborators that haven't been pushed upstream (point 1). You're right that those should be contributed back. Would it be enough to revise the underlying process so that spreadsheets added by our collaborators get merged into their respective CSVs and turned into PRs against the citizenlab repo? I'll try to remember that is happening if we decide to add 100k URLs to a list or some other weirdness.

On point 2, did you guys document the motivation or thought process behind simplifying the schema? I think we're open to doing the same thing, but we have to talk through it internally.

On point 3, I don't anticipate we'll start reviewing URLs, so if you plan on reviewing our PRs like any other, that solves this. Otherwise, I like the idea from citizenlab/test-lists#264 of tagging our contributions as not-yet-reviewed.

Point 4 doesn't really matter as long we're maintaining a separate repo.

@hellais
Copy link
Author

hellais commented Jan 11, 2018

Thanks for your response @jdcc.

Would it be enough to revise the underlying process so that spreadsheets added by our collaborators get merged into their respective CSVs and turned into PRs against the citizenlab repo?

Yes I think that would be great and super useful for us and the rest of the research community.
In terms of evaluating which changes (if not all) should be contributed back upstream, we wrote up this document explaining the rational behind the test lists creation: https://ooni.torproject.org/get-involved/contribute-test-lists/.

On point 2, did you guys document the motivation or thought process behind simplifying the schema?

I would say most of the thought process behind that process is documented inside of this ticket: citizenlab/test-lists#27. If you have questions on some particular aspect me, @sneft or @agrabeli would be happy to answer it in more detail.

On point 3, I don't anticipate we'll start reviewing URLs, so if you plan on reviewing our PRs like any other, that solves this

I don't see a problem in this. As long as your contributions go through the standard pull request review process, they will be subject to the same review as any other URL list.

Point 4 doesn't really matter as long we're maintaining a separate repo

Makes sense

@jdcc
Copy link
Collaborator

jdcc commented Jan 16, 2018

In terms of evaluating which changes (if not all) should be contributed back upstream...

We generally follow the same guidelines. As deviations of this are a small minority of the updates, I say we just toss out the PRs that don't follow those guidelines.

Thanks for everything else. I'll get this in my dev queue.

@hellais
Copy link
Author

hellais commented Feb 21, 2020

As I was doing some sprint cleaning issues, I cam across this citizenlab/test-lists#236.

I realise a bunch of time has passed, but I was wondering if this is still something we can look into.

The two repos may have diverged so much at this point that it may be a bit hard to track the changes though, but perhaps you have some thoughts on this.

@bact
Copy link

bact commented Mar 20, 2020

One of the thing that berkmancenter/url-lists/ has but citizenlab/test-list doesn't is https://github.com/berkmancenter/url-lists/blob/master/country_groups.csv which put countries into regions. It looks like berkmancenter extends special focus for MENA region: see https://github.com/berkmancenter/url-lists/tree/master/lists/geopolitical_lists

This may related to citizenlab/test-lists#478 on cis.csv as well.

I'm not sure how OONI at the moment will use or plan to use that information.

--

Also, I have checked berkmancenter's Thailand list, it is more than twice the size of one in citizenlab's, although part of the reason is that berkmancenter's one includes lots of inactive/dead links which already got removed in citizenlab's one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants