-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test lists: investigate errors and redirections #1727
Comments
I created the bassosimone/gardener repository to collect scripts useful to check every entry inside the test list. As of bassosimone/test-lists-gardener@c9892fb, this repository only contains code for measuring every entry inside the test list and producing a small report. I am using custom uncommitted scripts for performing data analysis. |
Classification of errorsI run tests on a GCE box located in the Amsterdam area. The number of entries in the test list that failed is surprisingly significant compared to the total number of errors (I expected to see fewer errors).
I classified errors (which were raw Go errors) using rules compatible with what OONI Probe does. |
Length of the redirect chainsIn case of failure, the redirect chain contains zero elements. If there is no redirection, it contains a single element. Otherwise, it contains two or more elements. The number of redirections is
I was surprised to see that in most cases, the redirect chain is short. I expected to see longer chains. |
Status code for redirectsWe only consider redirect chains not shorter than two elements. We classify the HTTP status code:
Considering that 301 and 307 are permanent redirects and 302 and 308 are temporary redirects, if we want to trust the semantics applied to status codes, then in most cases we should probably use the new URL directly. Here is the same table when only considering chains exactly equal to two elements:
We can conclude that in 81% of the cases the real URL to measure is probably just one |
Changes in the first redirectionWe start by checking what changes between the test list URL and the first redirection.
Note that any input URL may fall in one or more of these categories. For this reason, we cannot compute the sum of the percentages: we are not counting disjoint events. |
Changes in subsequent redirectionsWe analyze the changes between the URL of the first redirection and the final URL. (That is, this table shows what additional changes after the first redirection we see.)
Note that any input URL may fall in one or more of these categories. For this reason, we cannot compute the sum of the percentages: we are not counting disjoint events. |
We should repeat the experiment in the network in which we run test helpers (Greenhost or DigitalOcean) to check whether we have fewer errors. We should also save the length in bytes of each webpage. |
I repeated the experiment from a DigitalOcean droplet using bassosimone/test-lists-gardener@8cf3766. Here are the new results. |
Classification of errorsHere's what we measured from DigitalOcean:
TBD: compare with previous measurement and write comment. Note: in GCE I also mistakenly tested the headers of the CSV resulting in It's also interesting to see that |
Length of the redirect chains
At a first glance, results are ~same as from GCE. |
Changes in the first redirection
Here the results are in line with before. |
Status code of the first redirectionWe observe the "first" request in a chain and record its status code.
It's quite similar as before. |
Size of webpagesThis is a new metric: we take the last response in the redirection chain and build the distribution of the body size. The current implementation cuts bodies larger than The following plot, instead, shows the cumulative sizes before fetching the final pages. That is, this is the number of bytes we download when following all the redirections: In most cases, redirections have an empty body, or a very small body. |
Classification of input URLsLet's just classify the input URLs in the test list:
It's interesting to see how many
We need to think about these topics. They may be quite useful to understand how to update the test lists. |
Honour 30{1,8} and classify againLet's do the easiest change. Let's do what a search engine would do and automatically update all URLs in the test list by following their
After this change, we are still left with a significant fraction of HTTP websites. I think it's completely reasonable to update all URLs in the test list for which we have a |
Investigating the remaining HTTP URLsAfter the previous analysis, there are still 12088 websites in the test list whose URL is HTTP and do not provide a The criteria to determine whether they support HTTPS is the following. First, we must be able to establish a TLS connection with the server. Second, the webpage obtained using HTTPS is between 0.7 and 1/0.7 of the size of the webpage obtained using HTTP. The latter is a very simplistic check. For more accuracy, we could have used ssdeep or fuzzywhuzzy. The result of this investigation is that 5475 out of 12088 websites support HTTPS even though they do not provide a redirect to us using The general rule for compiling the test lists that I am aware of is the following. If a website supports both, we should give preference to HTTPS in the test lists. Therefore, if we were to follow this rule, we could have just 6613 HTTP websites in the test lists. That number would roughly correspond to 19% of the URLs. Another (perhaps orthogonal) possibility worth exploring is having the new web connectivity test helper check for whether the website supports HTTP and HTTPS. The algorithm could roughly be as follows:
The bottom line of this reasoning seems the following. If we accept the notion that the test helper navigates redirection and instructs the client about what URLs and endpoints to test (including on whether to use QUIC), then our job becomes significantly simpler because:
So, it seems the question now becomes whether we can write such a test helper and which issues a test helper working like this could possibly cause to us. |
More in-depth study of bodies using ssdeepFor the 5475 HTTP URLs that support HTTPS but do not redirect, let's plot the difference between the length of the HTTP page and the length of the HTTPS page. This tells us how often the two pages have the ~same length. It turns out that in most cases the two pages have the ~same length. We can now try to use ssdeep on the web pages that really have a comparable length. Let us now plot the ssdeep score (a number between 0 and 100 where 100 means perfect matching) versus the length of the HTTP page. The -10, -20, -30, and -40 scores indicates cases where it was not possible to compute the ssdeep score: -10 is when We cannot feed to As before, we have special negative scores (20% of the pages). For the other scores, we see that in many cases the score is quite far from 100. What is going to be funny is trying to understand what that means. So, I took a random page with score |
Same as above but using TLSHLet us use the TrendMicro Locally Sensitive Hash fuzzy hashing function (aka TLSH). A quick skim into the associated paper indicates that this hashing scheme has some advantages over ssdeep, particularly because its similarity output is not limited in the integer range between 0 and 100, and (IIUC) because it degrades more smoothly. The output range for TLSH comparison is between The paper introducing TLSH includes in Section IV a table (Table II) that indicates (IIUC) the precision and recall of TLSH compared with SSDEEP and other fuzzy hashing techniques. The sample they used to generate the table was small because they needed to compare files manually. From that table, it seems that we have a false positive rate of ~7% and a detection rate of ~94% if we keep the TLSH score strictly lower than 100. I'm not sure whether this result can be generalized so easily, but it may probably be useful to start thinking about what the score above means. Let us also add a pre-processing step. Under the assumption that the input is HTML (which can be verified by checking the So, because Another aspect that it may be interesting to look into is trying to figure out whether files whose Checking high difference scores with bluemonday preprocessingSo, I checked one of the pages with a difference score of Then I looked at pages with a difference score of Here's another random sampling check: Here's instead a case with difference score diff -u cache/aHR0cDovL2Jsb2dzLndzai5jb20vaW5kaWFyZWFsdGltZS90YWcvYXJpbmRhbS1jaGF1ZGh1cmk= cache/aHR0cHM6Ly9ibG9ncy53c2ouY29tL2luZGlhcmVhbHRpbWUvdGFnL2FyaW5kYW0tY2hhdWRodXJp
--- cache/aHR0cDovL2Jsb2dzLndzai5jb20vaW5kaWFyZWFsdGltZS90YWcvYXJpbmRhbS1jaGF1ZGh1cmk= 2021-08-04 19:42:57.000000000 +0200
+++ cache/aHR0cHM6Ly9ibG9ncy53c2ouY29tL2luZGlhcmVhbHRpbWUvdGFnL2FyaW5kYW0tY2hhdWRodXJp 2021-08-04 19:42:57.000000000 +0200
@@ -13,7 +13,7 @@
<HR noshade size="1px">
<PRE>
Generated by cloudfront (CloudFront)
-Request ID: WAWr0DG4pZh2v5UauauqZq1bht8IIAhTaRozhGjx5_E4Ggwb4ncVRw==
+Request ID: w7GEZUzh3QQpHzJbb7WSARYQrKUKbAwzc1kmLM_yopTp4nAaN1Zofw==
</PRE>
<ADDRESS>
</ADDRESS> |
ConclusionsWe should honour the We can divide those ~12k URLs in two sets: those for which we can establish a successful HTTPS connection with the same website and those for which we cannot. The former set consists of around 5.5k URLs. If we use the If we additionally pre-filter the content of webpages using bluemonday to exclude all HTML tags and only keep the textual content, then the number of equal webpages becomes 80%. So, we can convert between 2.5k and 4.4k URLs to HTTPS and we end up with between 7.6k and 9.0k URLs that are HTTP only. Some of these [7.6k, 9.0k] URLs may yield errors. The 12k - 5.5k URLs for which we could not convert include also URLs where we could not even access the HTTP website because of errors such as NXDOMAIN. When redesigning Web Connectivity, one problem we want to solve is that the probe may not be able to discover all the URLs to test because of censorship. For example, say for some reason the test list is not up-do-date and A more clever test helper could really help us to compensate for an inaccurate test list. For example, given the The only big conceptual issue that I continue to see, if we go down this road of improving Web Connectivity, is that a single measurement will increasingly be a collection of all the tests performed starting from the input URL and all the possibly relevant resources associated with it. That is, a user asking the question of all the cases in which |
This issue aims to perform an initial recognition aimed at understanding which URLs in the test list trigger errors from uncensored locations as well as the average amount of redirects in the test list.
This activity is functional to the webconnectivity redesign (#1714).
The text was updated successfully, but these errors were encountered: