test lists: investigate errors and redirections #1727

bassosimone · 2021-08-03T12:27:50Z

This issue aims to perform an initial recognition aimed at understanding which URLs in the test list trigger errors from uncensored locations as well as the average amount of redirects in the test list.

This activity is functional to the webconnectivity redesign (#1714).

bassosimone · 2021-08-03T12:32:07Z

I created the bassosimone/gardener repository to collect scripts useful to check every entry inside the test list. As of bassosimone/test-lists-gardener@c9892fb, this repository only contains code for measuring every entry inside the test list and producing a small report. I am using custom uncommitted scripts for performing data analysis.

bassosimone · 2021-08-03T12:36:29Z

Classification of errors

I run tests on a GCE box located in the Amsterdam area. The number of entries in the test list that failed is surprisingly significant compared to the total number of errors (I expected to see fewer errors).

Result	Frequency	Percentage
Success	32149	91.37
generic_timeout_error	1016	2.89
dns_nxdomain_error	444	1.26
connection_refused	259	0.74
ssl_invalid_hostname	246	0.70
connection_reset	246	0.70
ssl_unknown_authority	170	0.48
ssl_invalid_certificate	149	0.42
dns_lookup_error	148	0.42
unsupported_protocol_scheme	144	0.41
eof_error	90	0.26
host_unreachable	56	0.16
http_redirect_error	30	0.09
ssl_handshake_error	20	0.06
http2_protocol_error	8	0.02
network_unreachable	5	0.01
http2_stream_error	5	0.01
Total	35185	100.00

I classified errors (which were raw Go errors) using rules compatible with what OONI Probe does.

bassosimone · 2021-08-03T12:42:37Z

Length of the redirect chains

In case of failure, the redirect chain contains zero elements. If there is no redirection, it contains a single element. Otherwise, it contains two or more elements. The number of redirections is N - 1, where N is the chain length.

Redirect depth	Frequency	Percentage
0	3036	8.63
1	19253	54.71
2	10149	28.85
3	2229	6.36
4	362	1.03
5	131	0.36
6	19	0.05
7	4	0.01
8	1	0.00
9	1	0.00
Total	35185	100.00

I was surprised to see that in most cases, the redirect chain is short. I expected to see longer chains.

bassosimone · 2021-08-03T12:58:06Z

Status code for redirects

We only consider redirect chains not shorter than two elements. We classify the HTTP status code:

Status	Frequency	Percentage
301	10418	81%
302	2383	18%
307	47	~0%
308	31	~0%
303	17	~0%
other	0	0.00%
total	12896	100.00%

Considering that 301 and 307 are permanent redirects and 302 and 308 are temporary redirects, if we want to trust the semantics applied to status codes, then in most cases we should probably use the new URL directly.

Here is the same table when only considering chains exactly equal to two elements:

Status	Frequency	Percentage
301	8226	81%
302	1854	18%
307	29	~0%
308	23	~0%
303	17	~0%
other	0	0.00%
total	10149	100.00%

We can conclude that in 81% of the cases the real URL to measure is probably just one 301 away.

bassosimone · 2021-08-03T14:27:30Z

Changes in the first redirection

We start by checking what changes between the test list URL and the first redirection.

Change	Frequency	Percentage
`http://` => `https://`	8445	65%
`a.com` => `b.org`	2235	16%
`/` => `/foo`	2133	17%
`example.com` => `www.example.com`	1459	11%
`www.example.com` => `example.com`	768	6%
`https://` => `http://`	140	1%
total	12896	N/A

Note that any input URL may fall in one or more of these categories. For this reason, we cannot compute the sum of the percentages: we are not counting disjoint events.

bassosimone · 2021-08-03T15:10:39Z

Changes in subsequent redirections

We analyze the changes between the URL of the first redirection and the final URL. (That is, this table shows what additional changes after the first redirection we see.)

Change	Frequency	Percentage
`example.com` => `www.example.com`	621	23%
`/` => `/foo`	594	22%
`http://` => `https://`	544	20%
`a.com` => `b.org`	440	16%
`www.example.com` => `example.com`	336	12%
`https://` => `http://`	14	0.5%
total	2747	N/A

Note that any input URL may fall in one or more of these categories. For this reason, we cannot compute the sum of the percentages: we are not counting disjoint events.

bassosimone · 2021-08-04T07:06:15Z

We should repeat the experiment in the network in which we run test helpers (Greenhost or DigitalOcean) to check whether we have fewer errors. We should also save the length in bytes of each webpage.

bassosimone · 2021-08-04T09:42:58Z

I repeated the experiment from a DigitalOcean droplet using bassosimone/test-lists-gardener@8cf3766. Here are the new results.

bassosimone · 2021-08-04T09:43:22Z

Classification of errors

Here's what we measured from DigitalOcean:

Result	Frequency	Percentage
Success	32472	92.67
generic_timeout_error	699	1.99
dns_nxdomain_error	428	1.22
ssl_invalid_hostname	286	0.82
connection_reset	260	0.74
http2_protocol_error	211	0.60
ssl_invalid_certificate	204	0.58
ssl_unknown_authority	174	0.50
connection_refused	129	0.37
eof_error	86	0.25
host_unreachable	56	0.16
ssl_handshake_error	31	0.09
http2_stream_error	5	0.01
Total	35041	100.00

TBD: compare with previous measurement and write comment.

Note: in GCE I also mistakenly tested the headers of the CSV resulting in 144 errors with type unsupported_protocol_scheme, and 35185-144 = 35041.

It's also interesting to see that http_redirect_error is gone because now we're using cookies in this second version of the gardener tool for the test lists. So we know that ~30 URLs cannot be tested w/o cookies.

bassosimone · 2021-08-04T09:56:03Z

Length of the redirect chains

Redirect chain length	Frequency	Percentage
0	2228	6
1	19767	56
2	10273	29
3	2214	6
4	378	1
5	128	~0
6	33	~0
7	5	~0
8	1	~0
9	2	~0
11	12	~0
Total	35041	100

At a first glance, results are ~same as from GCE.

bassosimone · 2021-08-04T10:10:51Z

Changes in the first redirection

Change	Frequency	Percentage
`http://` => `https://`	8618	66%
`a.com` => `b.org`	2193	17%
`/` => `/foo`	2159	17%
`example.com` => `www.example.com`	1464	11%
`www.example.com` => `example.com`	784	6%
`https://` => `http://`	138	1%
total	13046	N/A

Here the results are in line with before.

bassosimone · 2021-08-04T10:16:18Z

Status code of the first redirection

We observe the "first" request in a chain and record its status code.

Status	Frequency	Percentage
301	10573	81%
302	2375	18%
307	45	~0%
308	34	~0%
303	18	~0%
other	0	0.00%
total	13046	100.00%

It's quite similar as before.

bassosimone · 2021-08-04T10:56:34Z

Size of webpages

This is a new metric: we take the last response in the redirection chain and build the distribution of the body size.

The current implementation cuts bodies larger than 1<<17 to 1<<17.

The following plot, instead, shows the cumulative sizes before fetching the final pages. That is, this is the number of bytes we download when following all the redirections:

In most cases, redirections have an empty body, or a very small body.

bassosimone · 2021-08-04T12:26:19Z

Classification of input URLs

Let's just classify the input URLs in the test list:

Class	Frequency	Percentage
http_homepage	16932	48%
https_homepage	12329	35%
https_other	2955	8%
http_other	2825	8%
other	0	0%
total	35041	100%

It's interesting to see how many http_homepage entries we have. A related question is what was the intent of who created the test list in the first place. My personal guess is the following:

the test list is not a block list, so we don't know whether the entries are blocked, but we may reason about why those entries were added and therefore try to figure out how they could have been blocked at the time;
if the entry is HTTP and a homepage and the entry was blocked, then probably there was some keyword rule on the Host header or on the content of the page itself;
if the entry is HTTP and not a homepage, then probably what mattered was the URL or the web page content;
if the entry is HTTPS and a homepage, then probably what mattered was DNS/TCP/TLS;
finally, if it's HTTP and not a homepage, it is probably a resource that someone wants to download to ensure we can download related resources and to see the speed at which we fetch them.

We need to think about these topics. They may be quite useful to understand how to update the test lists.

bassosimone · 2021-08-04T14:26:02Z

Honour 30{1,8} and classify again

Let's do the easiest change. Let's do what a search engine would do and automatically update all URLs in the test list by following their 301 or 308 redirect. Let's now classify the URLs we obtained after that.

Class	Frequency	Percentage	Change since before	Percentage
https_homepage	17883	51%	+5554	+16%
http_homepage	10395	30%	-6537	-19%
https_other	5070	14%	+2115	+6%
http_other	1693	5%	-1132	-3%
other	0	0%	0	0%
total	35041	100%	0	0%

After this change, we are still left with a significant fraction of HTTP websites. I think it's completely reasonable to update all URLs in the test list for which we have a 301 or 308, because that's what a search engine would do.

bassosimone · 2021-08-05T11:30:53Z

Investigating the remaining HTTP URLs

After the previous analysis, there are still 12088 websites in the test list whose URL is HTTP and do not provide a 301 or 308 redirection. Let us now walk through them and assess which support HTTPS satisfactorily.

The criteria to determine whether they support HTTPS is the following. First, we must be able to establish a TLS connection with the server. Second, the webpage obtained using HTTPS is between 0.7 and 1/0.7 of the size of the webpage obtained using HTTP. The latter is a very simplistic check. For more accuracy, we could have used ssdeep or fuzzywhuzzy.

The result of this investigation is that 5475 out of 12088 websites support HTTPS even though they do not provide a redirect to us using 301 or 308.

The general rule for compiling the test lists that I am aware of is the following. If a website supports both, we should give preference to HTTPS in the test lists. Therefore, if we were to follow this rule, we could have just 6613 HTTP websites in the test lists. That number would roughly correspond to 19% of the URLs.

Another (perhaps orthogonal) possibility worth exploring is having the new web connectivity test helper check for whether the website supports HTTP and HTTPS. The algorithm could roughly be as follows:

if the input URL is HTTP, check for HTTPS and, if it works, return to the client a response indicating it should test both;
if the input URL is HTTPS, check for HTTP and, if it works, return to the client a response indicating it should test both.

The bottom line of this reasoning seems the following. If we accept the notion that the test helper navigates redirection and instructs the client about what URLs and endpoints to test (including on whether to use QUIC), then our job becomes significantly simpler because:

by changing the implementation of the test helper, we can instruct (new) web connectivity clients to test different URLs/endpoints depending on what we need (for example, we may stop testing for HTTP if it becomes useless);
the need to constantly curate the test list so that we keep links up-to-date is less pressing (it is still advisable to have a procedure in place to curate it, and, most likely, we should really honor 301 redirects, but such a test helper could still compensate for issues/errors in the test list and help us test what we'd like anyway).

So, it seems the question now becomes whether we can write such a test helper and which issues a test helper working like this could possibly cause to us.

bassosimone · 2021-08-05T14:28:33Z

More in-depth study of bodies using ssdeep

For the 5475 HTTP URLs that support HTTPS but do not redirect, let's plot the difference between the length of the HTTP page and the length of the HTTPS page. This tells us how often the two pages have the ~same length.

It turns out that in most cases the two pages have the ~same length. We can now try to use ssdeep on the web pages that really have a comparable length.

Let us now plot the ssdeep score (a number between 0 and 100 where 100 means perfect matching) versus the length of the HTTP page. The -10, -20, -30, and -40 scores indicates cases where it was not possible to compute the ssdeep score: -10 is when len(httpWebpage) < 4k, -20 is when len(httpsWebpage) < 4k, -30 is when the HTTP webpage is less than 90% of the size of the HTTPS webpage, -40 is when the HTTPS webpage is less than 90% of the size of the HTTP webpage.

We cannot feed to ssdeep pages smaller than 4k (i.e., 1<<12) because it refuses to process them. It seems there is not really a correlation between the size of the page and the score. Equipped with this knowledge we can now check what is the distribution of the ssdeep score for these pages. And, more importantly, what does that mean for us.

As before, we have special negative scores (20% of the pages). For the other scores, we see that in many cases the score is quite far from 100. What is going to be funny is trying to understand what that means.

So, I took a random page with score 40 (i.e. "low") and inspected the HTTP and HTTPS version. What is very disappointing is that, as far as the real content is concerned, the pages are basically the same (at least, they appear to me to be the same or really really really close in content). The differences (as evidenced through a diff) are mostly differences in the metadata, e.g., links are HTTPS links instead of HTTP links or just point elsewhere). We clearly need to improve upon this initial solution that only relied on blindly applying ssdeep to the whole content of the file itself.

bassosimone · 2021-08-06T09:59:11Z

Same as above but using TLSH

Let us use the TrendMicro Locally Sensitive Hash fuzzy hashing function (aka TLSH). A quick skim into the associated paper indicates that this hashing scheme has some advantages over ssdeep, particularly because its similarity output is not limited in the integer range between 0 and 100, and (IIUC) because it degrades more smoothly.

The output range for TLSH comparison is between 0 (equality) to (IIUC) infinity. I have arbitrarily chosen to represent cases where we could not perform the comparison using the 1e06 value. The idea is to create weight on the right end of the empirical CDF and (if my estimate is correct) 1e06 is bigger than the biggest value emitted by TLSH for this input set. So, the following is the plot of the empirical CDF for the 5475 HTTP URLs that support HTTPS but do not redirect:

The paper introducing TLSH includes in Section IV a table (Table II) that indicates (IIUC) the precision and recall of TLSH compared with SSDEEP and other fuzzy hashing techniques. The sample they used to generate the table was small because they needed to compare files manually. From that table, it seems that we have a false positive rate of ~7% and a detection rate of ~94% if we keep the TLSH score strictly lower than 100. I'm not sure whether this result can be generalized so easily, but it may probably be useful to start thinking about what the score above means.

Let us also add a pre-processing step. Under the assumption that the input is HTML (which can be verified by checking the Content-Type), we strip all HTML tags using bluemonday's StrictPolicy HTML sanitization policy. The bluemonday gives us (in most cases) a web page consisting of text and blanks. So, let us see what we obtain with this preprocessing step in place in terms of equality of webpages:

So, because 0 should mean equality, here we see that we really have many equal pages. We should probably repeat the analysis by filtering and using ssdeep instead of tlsh to see what happens with this fuzzy hashing scheme. (Though the problem would be that ssdeep refuses to compare pages smaller than 4 KiB.)

Another aspect that it may be interesting to look into is trying to figure out whether files whose tlsh score is quite distant from zero are actually different. That is, what makes the score so high?

Checking high difference scores with bluemonday preprocessing

So, I checked one of the pages with a difference score of 1e06, and here's what I learned. Both the HTTP and the HTTPS webpage do not contain any content, only JavaScript and other meta information. Though the two pages are very different.

Then I looked at pages with a difference score of 828 (website: http://www.pocoes.ba.gov.br/). It turns out the HTTP version of the website just loads inside an iframe another website. Upon inspection, this website looks very different from the website you load by using the corresponding HTTPS URL.

Here's another random sampling check: http://igihe.bi/ (difference score: 761). The response in the HTTP case was a short webpage mentioning that the request contained some headers that made it not acceptable. The HTTPS response is the real web page.

Here's instead a case with difference score 53: https://blogs.wsj.com/indiarealtime/tag/arindam-chaudhuri. This is the diff between the HTTP and HTTPS version of the webpage:

diff -u cache/aHR0cDovL2Jsb2dzLndzai5jb20vaW5kaWFyZWFsdGltZS90YWcvYXJpbmRhbS1jaGF1ZGh1cmk= cache/aHR0cHM6Ly9ibG9ncy53c2ouY29tL2luZGlhcmVhbHRpbWUvdGFnL2FyaW5kYW0tY2hhdWRodXJp
--- cache/aHR0cDovL2Jsb2dzLndzai5jb20vaW5kaWFyZWFsdGltZS90YWcvYXJpbmRhbS1jaGF1ZGh1cmk=	2021-08-04 19:42:57.000000000 +0200
+++ cache/aHR0cHM6Ly9ibG9ncy53c2ouY29tL2luZGlhcmVhbHRpbWUvdGFnL2FyaW5kYW0tY2hhdWRodXJp	2021-08-04 19:42:57.000000000 +0200
@@ -13,7 +13,7 @@
 <HR noshade size="1px">
 <PRE>
 Generated by cloudfront (CloudFront)
-Request ID: WAWr0DG4pZh2v5UauauqZq1bht8IIAhTaRozhGjx5_E4Ggwb4ncVRw==
+Request ID: w7GEZUzh3QQpHzJbb7WSARYQrKUKbAwzc1kmLM_yopTp4nAaN1Zofw==
 </PRE>
 <ADDRESS>
 </ADDRESS>

bassosimone · 2021-08-06T11:58:50Z

Conclusions

We should honour the 301 and 308 redirects: they are a legitimate reason for updating the test list. If we implement this very reasonable change, we're left with ~12k HTTP URLs in the test list.

We can divide those ~12k URLs in two sets: those for which we can establish a successful HTTPS connection with the same website and those for which we cannot. The former set consists of around 5.5k URLs.

If we use the tlsh difference between HTTP and HTTPS as a metric and we define equality in a very straightforward way, i.e, when such difference is zero, then we can safely update to HTTPS 50% of the remaining 5.5k URLs.

If we additionally pre-filter the content of webpages using bluemonday to exclude all HTML tags and only keep the textual content, then the number of equal webpages becomes 80%.

So, we can convert between 2.5k and 4.4k URLs to HTTPS and we end up with between 7.6k and 9.0k URLs that are HTTP only. Some of these [7.6k, 9.0k] URLs may yield errors. The 12k - 5.5k URLs for which we could not convert include also URLs where we could not even access the HTTP website because of errors such as NXDOMAIN.

When redesigning Web Connectivity, one problem we want to solve is that the probe may not be able to discover all the URLs to test because of censorship. For example, say for some reason the test list is not up-do-date and http://example.com is a 301 redirect to https://example.com. However, say http://example.com is censored with connection reset when the censorship equipment spots the example.com host header. In such a case, the probe cannot discover that it also needs to test for https://www.example.com without help from the test helper.

A more clever test helper could really help us to compensate for an inaccurate test list. For example, given the https://example.org input, it may be tempting to check also for http://example.org. This seems an extra argument in favour of migrating (and confusingly also for not migrating!) as many URLs as possible to HTTPS.

The only big conceptual issue that I continue to see, if we go down this road of improving Web Connectivity, is that a single measurement will increasingly be a collection of all the tests performed starting from the input URL and all the possibly relevant resources associated with it. That is, a user asking the question of all the cases in which https://www.example.com is SNI blocked (or TLS blocked or just blocked) may fail to find the desired answer if searching for the input field being equal to https://example.com, because they may miss crucial http://example.com measurements. (This is probably the reason why the search input box in explorer suggests to enter a domain rather than a full URL.)

bassosimone · 2021-08-19T15:38:21Z

I completed this activity in the previous Sprint. So, I am going to put it back in the previous Sprint and close it. I didn't close it until now because I wanted to discuss the results contained in this issue with @hellais and @agrabeli first.

The follow-up issue is this epic: #1745.

bassosimone added this to the Sprint 45 - Antarctic Krill milestone Aug 3, 2021

bassosimone self-assigned this Aug 3, 2021

bassosimone added data quality effort/XL priority/medium labels Aug 3, 2021

bassosimone added effort/L and removed effort/XL labels Aug 3, 2021

bassosimone added effort/XL and removed effort/L labels Aug 5, 2021

This was referenced Aug 6, 2021

Use simhash to detect blockpages ooni/backend#183

Open

oonitemplates: compute body simhash and sha256sum #1632

Closed

This was referenced Aug 6, 2021

Add support for storing simhash of control body in web_connectivity #1719

Open

webconnectivity: write prototype of new test helper #1729

Closed

cli: new web connectivity test helper ooni/probe-cli#431

Closed

bassosimone changed the title ~~Investigate errors and redirections in the test list~~ test lists: investigate errors and redirections Aug 13, 2021

bassosimone modified the milestones: Sprint 45 - Antarctic Krill, Sprint 46 - Happy Oyster Aug 16, 2021

bassosimone mentioned this issue Aug 19, 2021

process: test lists maintenance #1745

Open

bassosimone modified the milestones: Sprint 46 - Happy Oyster, Sprint 45 - Antarctic Krill Aug 19, 2021

bassosimone closed this as completed Aug 19, 2021

This was referenced Aug 19, 2021

test-lists: upgrade URLs for which we have 301/308 redirects #1746

Open

test-lists: semi-automatically prune NXDOMAIN URLs #1747

Closed

test-lists: design prototype "gardener" for maintaining the lists #1748

Closed

bassosimone mentioned this issue Oct 8, 2021

websteps vs webconnectivity comparison #1797

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test lists: investigate errors and redirections #1727

test lists: investigate errors and redirections #1727

bassosimone commented Aug 3, 2021

bassosimone commented Aug 3, 2021

bassosimone commented Aug 3, 2021 •

edited

Loading

bassosimone commented Aug 3, 2021 •

edited

Loading

bassosimone commented Aug 3, 2021 •

edited

Loading

bassosimone commented Aug 3, 2021 •

edited

Loading

bassosimone commented Aug 3, 2021 •

edited

Loading

bassosimone commented Aug 4, 2021

bassosimone commented Aug 4, 2021 •

edited

Loading

bassosimone commented Aug 4, 2021 •

edited

Loading

bassosimone commented Aug 4, 2021 •

edited

Loading

bassosimone commented Aug 4, 2021

bassosimone commented Aug 4, 2021

bassosimone commented Aug 4, 2021 •

edited

Loading

bassosimone commented Aug 4, 2021 •

edited

Loading

bassosimone commented Aug 4, 2021 •

edited

Loading

bassosimone commented Aug 5, 2021 •

edited

Loading

bassosimone commented Aug 5, 2021 •

edited

Loading

bassosimone commented Aug 6, 2021 •

edited

Loading

bassosimone commented Aug 6, 2021

bassosimone commented Aug 19, 2021

test lists: investigate errors and redirections #1727

test lists: investigate errors and redirections #1727

Comments

bassosimone commented Aug 3, 2021

bassosimone commented Aug 3, 2021

bassosimone commented Aug 3, 2021 • edited Loading

Classification of errors

bassosimone commented Aug 3, 2021 • edited Loading

Length of the redirect chains

bassosimone commented Aug 3, 2021 • edited Loading

Status code for redirects

bassosimone commented Aug 3, 2021 • edited Loading

Changes in the first redirection

bassosimone commented Aug 3, 2021 • edited Loading

Changes in subsequent redirections

bassosimone commented Aug 4, 2021

bassosimone commented Aug 4, 2021 • edited Loading

bassosimone commented Aug 4, 2021 • edited Loading

Classification of errors

bassosimone commented Aug 4, 2021 • edited Loading

Length of the redirect chains

bassosimone commented Aug 4, 2021

Changes in the first redirection

bassosimone commented Aug 4, 2021

Status code of the first redirection

bassosimone commented Aug 4, 2021 • edited Loading

Size of webpages

bassosimone commented Aug 4, 2021 • edited Loading

Classification of input URLs

bassosimone commented Aug 4, 2021 • edited Loading

Honour 30{1,8} and classify again

bassosimone commented Aug 5, 2021 • edited Loading

Investigating the remaining HTTP URLs

bassosimone commented Aug 5, 2021 • edited Loading

More in-depth study of bodies using ssdeep

bassosimone commented Aug 6, 2021 • edited Loading

Same as above but using TLSH

Checking high difference scores with bluemonday preprocessing

bassosimone commented Aug 6, 2021

Conclusions

bassosimone commented Aug 19, 2021

bassosimone commented Aug 3, 2021 •

edited

Loading

bassosimone commented Aug 3, 2021 •

edited

Loading

bassosimone commented Aug 3, 2021 •

edited

Loading

bassosimone commented Aug 3, 2021 •

edited

Loading

bassosimone commented Aug 3, 2021 •

edited

Loading

bassosimone commented Aug 4, 2021 •

edited

Loading

bassosimone commented Aug 4, 2021 •

edited

Loading

bassosimone commented Aug 4, 2021 •

edited

Loading

bassosimone commented Aug 4, 2021 •

edited

Loading

bassosimone commented Aug 4, 2021 •

edited

Loading

bassosimone commented Aug 4, 2021 •

edited

Loading

bassosimone commented Aug 5, 2021 •

edited

Loading

bassosimone commented Aug 5, 2021 •

edited

Loading

bassosimone commented Aug 6, 2021 •

edited

Loading