Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test lists: investigate errors and redirections #1727

Closed
bassosimone opened this issue Aug 3, 2021 · 20 comments
Closed

test lists: investigate errors and redirections #1727

bassosimone opened this issue Aug 3, 2021 · 20 comments

Comments

@bassosimone
Copy link
Contributor

This issue aims to perform an initial recognition aimed at understanding which URLs in the test list trigger errors from uncensored locations as well as the average amount of redirects in the test list.

This activity is functional to the webconnectivity redesign (#1714).

@bassosimone bassosimone self-assigned this Aug 3, 2021
@bassosimone
Copy link
Contributor Author

I created the bassosimone/gardener repository to collect scripts useful to check every entry inside the test list. As of bassosimone/test-lists-gardener@c9892fb, this repository only contains code for measuring every entry inside the test list and producing a small report. I am using custom uncommitted scripts for performing data analysis.

@bassosimone
Copy link
Contributor Author

bassosimone commented Aug 3, 2021

Classification of errors

I run tests on a GCE box located in the Amsterdam area. The number of entries in the test list that failed is surprisingly significant compared to the total number of errors (I expected to see fewer errors).

Result Frequency Percentage
Success 32149 91.37
generic_timeout_error 1016 2.89
dns_nxdomain_error 444 1.26
connection_refused 259 0.74
ssl_invalid_hostname 246 0.70
connection_reset 246 0.70
ssl_unknown_authority 170 0.48
ssl_invalid_certificate 149 0.42
dns_lookup_error 148 0.42
unsupported_protocol_scheme 144 0.41
eof_error 90 0.26
host_unreachable 56 0.16
http_redirect_error 30 0.09
ssl_handshake_error 20 0.06
http2_protocol_error 8 0.02
network_unreachable 5 0.01
http2_stream_error 5 0.01
Total 35185 100.00

I classified errors (which were raw Go errors) using rules compatible with what OONI Probe does.

@bassosimone
Copy link
Contributor Author

bassosimone commented Aug 3, 2021

Length of the redirect chains

In case of failure, the redirect chain contains zero elements. If there is no redirection, it contains a single element. Otherwise, it contains two or more elements. The number of redirections is N - 1, where N is the chain length.

Redirect depth Frequency Percentage
0 3036 8.63
1 19253 54.71
2 10149 28.85
3 2229 6.36
4 362 1.03
5 131 0.36
6 19 0.05
7 4 0.01
8 1 0.00
9 1 0.00
Total 35185 100.00

I was surprised to see that in most cases, the redirect chain is short. I expected to see longer chains.

@bassosimone
Copy link
Contributor Author

bassosimone commented Aug 3, 2021

Status code for redirects

We only consider redirect chains not shorter than two elements. We classify the HTTP status code:

Status Frequency Percentage
301 10418 81%
302 2383 18%
307 47 ~0%
308 31 ~0%
303 17 ~0%
other 0 0.00%
total 12896 100.00%

Considering that 301 and 307 are permanent redirects and 302 and 308 are temporary redirects, if we want to trust the semantics applied to status codes, then in most cases we should probably use the new URL directly.

Here is the same table when only considering chains exactly equal to two elements:

Status Frequency Percentage
301 8226 81%
302 1854 18%
307 29 ~0%
308 23 ~0%
303 17 ~0%
other 0 0.00%
total 10149 100.00%

We can conclude that in 81% of the cases the real URL to measure is probably just one 301 away.

@bassosimone
Copy link
Contributor Author

bassosimone commented Aug 3, 2021

Changes in the first redirection

We start by checking what changes between the test list URL and the first redirection.

Change Frequency Percentage
http:// => https:// 8445 65%
a.com => b.org 2235 16%
/ => /foo 2133 17%
example.com => www.example.com 1459 11%
www.example.com => example.com 768 6%
https:// => http:// 140 1%
total 12896 N/A

Note that any input URL may fall in one or more of these categories. For this reason, we cannot compute the sum of the percentages: we are not counting disjoint events.

@bassosimone
Copy link
Contributor Author

bassosimone commented Aug 3, 2021

Changes in subsequent redirections

We analyze the changes between the URL of the first redirection and the final URL. (That is, this table shows what additional changes after the first redirection we see.)

Change Frequency Percentage
example.com => www.example.com 621 23%
/ => /foo 594 22%
http:// => https:// 544 20%
a.com => b.org 440 16%
www.example.com => example.com 336 12%
https:// => http:// 14 0.5%
total 2747 N/A

Note that any input URL may fall in one or more of these categories. For this reason, we cannot compute the sum of the percentages: we are not counting disjoint events.

@bassosimone
Copy link
Contributor Author

We should repeat the experiment in the network in which we run test helpers (Greenhost or DigitalOcean) to check whether we have fewer errors. We should also save the length in bytes of each webpage.

@bassosimone
Copy link
Contributor Author

bassosimone commented Aug 4, 2021

I repeated the experiment from a DigitalOcean droplet using bassosimone/test-lists-gardener@8cf3766. Here are the new results.

@bassosimone
Copy link
Contributor Author

bassosimone commented Aug 4, 2021

Classification of errors

Here's what we measured from DigitalOcean:

Result Frequency Percentage
Success 32472 92.67
generic_timeout_error 699 1.99
dns_nxdomain_error 428 1.22
ssl_invalid_hostname 286 0.82
connection_reset 260 0.74
http2_protocol_error 211 0.60
ssl_invalid_certificate 204 0.58
ssl_unknown_authority 174 0.50
connection_refused 129 0.37
eof_error 86 0.25
host_unreachable 56 0.16
ssl_handshake_error 31 0.09
http2_stream_error 5 0.01
Total 35041 100.00

TBD: compare with previous measurement and write comment.

Note: in GCE I also mistakenly tested the headers of the CSV resulting in 144 errors with type unsupported_protocol_scheme, and 35185-144 = 35041.

It's also interesting to see that http_redirect_error is gone because now we're using cookies in this second version of the gardener tool for the test lists. So we know that ~30 URLs cannot be tested w/o cookies.

@bassosimone
Copy link
Contributor Author

bassosimone commented Aug 4, 2021

Length of the redirect chains

Redirect chain length Frequency Percentage
0 2228 6
1 19767 56
2 10273 29
3 2214 6
4 378 1
5 128 ~0
6 33 ~0
7 5 ~0
8 1 ~0
9 2 ~0
11 12 ~0
Total 35041 100

At a first glance, results are ~same as from GCE.

@bassosimone
Copy link
Contributor Author

Changes in the first redirection

Change Frequency Percentage
http:// => https:// 8618 66%
a.com => b.org 2193 17%
/ => /foo 2159 17%
example.com => www.example.com 1464 11%
www.example.com => example.com 784 6%
https:// => http:// 138 1%
total 13046 N/A

Here the results are in line with before.

@bassosimone
Copy link
Contributor Author

Status code of the first redirection

We observe the "first" request in a chain and record its status code.

Status Frequency Percentage
301 10573 81%
302 2375 18%
307 45 ~0%
308 34 ~0%
303 18 ~0%
other 0 0.00%
total 13046 100.00%

It's quite similar as before.

@bassosimone
Copy link
Contributor Author

bassosimone commented Aug 4, 2021

Size of webpages

This is a new metric: we take the last response in the redirection chain and build the distribution of the body size.

size-of-the-final-page

The current implementation cuts bodies larger than 1<<17 to 1<<17.

The following plot, instead, shows the cumulative sizes before fetching the final pages. That is, this is the number of bytes we download when following all the redirections:

before-the-final-page

In most cases, redirections have an empty body, or a very small body.

@bassosimone
Copy link
Contributor Author

bassosimone commented Aug 4, 2021

Classification of input URLs

Let's just classify the input URLs in the test list:

Class Frequency Percentage
http_homepage 16932 48%
https_homepage 12329 35%
https_other 2955 8%
http_other 2825 8%
other 0 0%
total 35041 100%

It's interesting to see how many http_homepage entries we have. A related question is what was the intent of who created the test list in the first place. My personal guess is the following:

  1. the test list is not a block list, so we don't know whether the entries are blocked, but we may reason about why those entries were added and therefore try to figure out how they could have been blocked at the time;

  2. if the entry is HTTP and a homepage and the entry was blocked, then probably there was some keyword rule on the Host header or on the content of the page itself;

  3. if the entry is HTTP and not a homepage, then probably what mattered was the URL or the web page content;

  4. if the entry is HTTPS and a homepage, then probably what mattered was DNS/TCP/TLS;

  5. finally, if it's HTTP and not a homepage, it is probably a resource that someone wants to download to ensure we can download related resources and to see the speed at which we fetch them.

We need to think about these topics. They may be quite useful to understand how to update the test lists.

@bassosimone
Copy link
Contributor Author

bassosimone commented Aug 4, 2021

Honour 30{1,8} and classify again

Let's do the easiest change. Let's do what a search engine would do and automatically update all URLs in the test list by following their 301 or 308 redirect. Let's now classify the URLs we obtained after that.

Class Frequency Percentage Change since before Percentage
https_homepage 17883 51% +5554 +16%
http_homepage 10395 30% -6537 -19%
https_other 5070 14% +2115 +6%
http_other 1693 5% -1132 -3%
other 0 0% 0 0%
total 35041 100% 0 0%

After this change, we are still left with a significant fraction of HTTP websites. I think it's completely reasonable to update all URLs in the test list for which we have a 301 or 308, because that's what a search engine would do.

@bassosimone
Copy link
Contributor Author

bassosimone commented Aug 5, 2021

Investigating the remaining HTTP URLs

After the previous analysis, there are still 12088 websites in the test list whose URL is HTTP and do not provide a 301 or 308 redirection. Let us now walk through them and assess which support HTTPS satisfactorily.

The criteria to determine whether they support HTTPS is the following. First, we must be able to establish a TLS connection with the server. Second, the webpage obtained using HTTPS is between 0.7 and 1/0.7 of the size of the webpage obtained using HTTP. The latter is a very simplistic check. For more accuracy, we could have used ssdeep or fuzzywhuzzy.

The result of this investigation is that 5475 out of 12088 websites support HTTPS even though they do not provide a redirect to us using 301 or 308.

The general rule for compiling the test lists that I am aware of is the following. If a website supports both, we should give preference to HTTPS in the test lists. Therefore, if we were to follow this rule, we could have just 6613 HTTP websites in the test lists. That number would roughly correspond to 19% of the URLs.

Another (perhaps orthogonal) possibility worth exploring is having the new web connectivity test helper check for whether the website supports HTTP and HTTPS. The algorithm could roughly be as follows:

  1. if the input URL is HTTP, check for HTTPS and, if it works, return to the client a response indicating it should test both;

  2. if the input URL is HTTPS, check for HTTP and, if it works, return to the client a response indicating it should test both.

The bottom line of this reasoning seems the following. If we accept the notion that the test helper navigates redirection and instructs the client about what URLs and endpoints to test (including on whether to use QUIC), then our job becomes significantly simpler because:

  1. by changing the implementation of the test helper, we can instruct (new) web connectivity clients to test different URLs/endpoints depending on what we need (for example, we may stop testing for HTTP if it becomes useless);

  2. the need to constantly curate the test list so that we keep links up-to-date is less pressing (it is still advisable to have a procedure in place to curate it, and, most likely, we should really honor 301 redirects, but such a test helper could still compensate for issues/errors in the test list and help us test what we'd like anyway).

So, it seems the question now becomes whether we can write such a test helper and which issues a test helper working like this could possibly cause to us.

@bassosimone
Copy link
Contributor Author

bassosimone commented Aug 5, 2021

More in-depth study of bodies using ssdeep

For the 5475 HTTP URLs that support HTTPS but do not redirect, let's plot the difference between the length of the HTTP page and the length of the HTTPS page. This tells us how often the two pages have the ~same length.

length-diff

It turns out that in most cases the two pages have the ~same length. We can now try to use ssdeep on the web pages that really have a comparable length.

Let us now plot the ssdeep score (a number between 0 and 100 where 100 means perfect matching) versus the length of the HTTP page. The -10, -20, -30, and -40 scores indicates cases where it was not possible to compute the ssdeep score: -10 is when len(httpWebpage) < 4k, -20 is when len(httpsWebpage) < 4k, -30 is when the HTTP webpage is less than 90% of the size of the HTTPS webpage, -40 is when the HTTPS webpage is less than 90% of the size of the HTTP webpage.

ssdeepcheck

We cannot feed to ssdeep pages smaller than 4k (i.e., 1<<12) because it refuses to process them. It seems there is not really a correlation between the size of the page and the score. Equipped with this knowledge we can now check what is the distribution of the ssdeep score for these pages. And, more importantly, what does that mean for us.

ssdeep-score

As before, we have special negative scores (20% of the pages). For the other scores, we see that in many cases the score is quite far from 100. What is going to be funny is trying to understand what that means.

So, I took a random page with score 40 (i.e. "low") and inspected the HTTP and HTTPS version. What is very disappointing is that, as far as the real content is concerned, the pages are basically the same (at least, they appear to me to be the same or really really really close in content). The differences (as evidenced through a diff) are mostly differences in the metadata, e.g., links are HTTPS links instead of HTTP links or just point elsewhere). We clearly need to improve upon this initial solution that only relied on blindly applying ssdeep to the whole content of the file itself.

@bassosimone
Copy link
Contributor Author

bassosimone commented Aug 6, 2021

Same as above but using TLSH

Let us use the TrendMicro Locally Sensitive Hash fuzzy hashing function (aka TLSH). A quick skim into the associated paper indicates that this hashing scheme has some advantages over ssdeep, particularly because its similarity output is not limited in the integer range between 0 and 100, and (IIUC) because it degrades more smoothly.

The output range for TLSH comparison is between 0 (equality) to (IIUC) infinity. I have arbitrarily chosen to represent cases where we could not perform the comparison using the 1e06 value. The idea is to create weight on the right end of the empirical CDF and (if my estimate is correct) 1e06 is bigger than the biggest value emitted by TLSH for this input set. So, the following is the plot of the empirical CDF for the 5475 HTTP URLs that support HTTPS but do not redirect:

tlsh-v9

The paper introducing TLSH includes in Section IV a table (Table II) that indicates (IIUC) the precision and recall of TLSH compared with SSDEEP and other fuzzy hashing techniques. The sample they used to generate the table was small because they needed to compare files manually. From that table, it seems that we have a false positive rate of ~7% and a detection rate of ~94% if we keep the TLSH score strictly lower than 100. I'm not sure whether this result can be generalized so easily, but it may probably be useful to start thinking about what the score above means.

Let us also add a pre-processing step. Under the assumption that the input is HTML (which can be verified by checking the Content-Type), we strip all HTML tags using bluemonday's StrictPolicy HTML sanitization policy. The bluemonday gives us (in most cases) a web page consisting of text and blanks. So, let us see what we obtain with this preprocessing step in place in terms of equality of webpages:

tlsh-v6

So, because 0 should mean equality, here we see that we really have many equal pages. We should probably repeat the analysis by filtering and using ssdeep instead of tlsh to see what happens with this fuzzy hashing scheme. (Though the problem would be that ssdeep refuses to compare pages smaller than 4 KiB.)

Another aspect that it may be interesting to look into is trying to figure out whether files whose tlsh score is quite distant from zero are actually different. That is, what makes the score so high?

Checking high difference scores with bluemonday preprocessing

So, I checked one of the pages with a difference score of 1e06, and here's what I learned. Both the HTTP and the HTTPS webpage do not contain any content, only JavaScript and other meta information. Though the two pages are very different.

Then I looked at pages with a difference score of 828 (website: http://www.pocoes.ba.gov.br/). It turns out the HTTP version of the website just loads inside an iframe another website. Upon inspection, this website looks very different from the website you load by using the corresponding HTTPS URL.

Here's another random sampling check: http://igihe.bi/ (difference score: 761). The response in the HTTP case was a short webpage mentioning that the request contained some headers that made it not acceptable. The HTTPS response is the real web page.

Here's instead a case with difference score 53: https://blogs.wsj.com/indiarealtime/tag/arindam-chaudhuri. This is the diff between the HTTP and HTTPS version of the webpage:

diff -u cache/aHR0cDovL2Jsb2dzLndzai5jb20vaW5kaWFyZWFsdGltZS90YWcvYXJpbmRhbS1jaGF1ZGh1cmk= cache/aHR0cHM6Ly9ibG9ncy53c2ouY29tL2luZGlhcmVhbHRpbWUvdGFnL2FyaW5kYW0tY2hhdWRodXJp
--- cache/aHR0cDovL2Jsb2dzLndzai5jb20vaW5kaWFyZWFsdGltZS90YWcvYXJpbmRhbS1jaGF1ZGh1cmk=	2021-08-04 19:42:57.000000000 +0200
+++ cache/aHR0cHM6Ly9ibG9ncy53c2ouY29tL2luZGlhcmVhbHRpbWUvdGFnL2FyaW5kYW0tY2hhdWRodXJp	2021-08-04 19:42:57.000000000 +0200
@@ -13,7 +13,7 @@
 <HR noshade size="1px">
 <PRE>
 Generated by cloudfront (CloudFront)
-Request ID: WAWr0DG4pZh2v5UauauqZq1bht8IIAhTaRozhGjx5_E4Ggwb4ncVRw==
+Request ID: w7GEZUzh3QQpHzJbb7WSARYQrKUKbAwzc1kmLM_yopTp4nAaN1Zofw==
 </PRE>
 <ADDRESS>
 </ADDRESS>

@bassosimone
Copy link
Contributor Author

Conclusions

We should honour the 301 and 308 redirects: they are a legitimate reason for updating the test list. If we implement this very reasonable change, we're left with ~12k HTTP URLs in the test list.

We can divide those ~12k URLs in two sets: those for which we can establish a successful HTTPS connection with the same website and those for which we cannot. The former set consists of around 5.5k URLs.

If we use the tlsh difference between HTTP and HTTPS as a metric and we define equality in a very straightforward way, i.e, when such difference is zero, then we can safely update to HTTPS 50% of the remaining 5.5k URLs.

If we additionally pre-filter the content of webpages using bluemonday to exclude all HTML tags and only keep the textual content, then the number of equal webpages becomes 80%.

So, we can convert between 2.5k and 4.4k URLs to HTTPS and we end up with between 7.6k and 9.0k URLs that are HTTP only. Some of these [7.6k, 9.0k] URLs may yield errors. The 12k - 5.5k URLs for which we could not convert include also URLs where we could not even access the HTTP website because of errors such as NXDOMAIN.

When redesigning Web Connectivity, one problem we want to solve is that the probe may not be able to discover all the URLs to test because of censorship. For example, say for some reason the test list is not up-do-date and http://example.com is a 301 redirect to https://example.com. However, say http://example.com is censored with connection reset when the censorship equipment spots the example.com host header. In such a case, the probe cannot discover that it also needs to test for https://www.example.com without help from the test helper.

A more clever test helper could really help us to compensate for an inaccurate test list. For example, given the https://example.org input, it may be tempting to check also for http://example.org. This seems an extra argument in favour of migrating (and confusingly also for not migrating!) as many URLs as possible to HTTPS.

The only big conceptual issue that I continue to see, if we go down this road of improving Web Connectivity, is that a single measurement will increasingly be a collection of all the tests performed starting from the input URL and all the possibly relevant resources associated with it. That is, a user asking the question of all the cases in which https://www.example.com is SNI blocked (or TLS blocked or just blocked) may fail to find the desired answer if searching for the input field being equal to https://example.com, because they may miss crucial http://example.com measurements. (This is probably the reason why the search input box in explorer suggests to enter a domain rather than a full URL.)

@bassosimone
Copy link
Contributor Author

I completed this activity in the previous Sprint. So, I am going to put it back in the previous Sprint and close it. I didn't close it until now because I wanted to discuss the results contained in this issue with @hellais and @agrabeli first.

The follow-up issue is this epic: #1745.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant