Signal Searcher attempts to find interesting and important signals in MLab's Internet measurement data using computers instead of people.
Measurement Lab is collecting lots and lots of measurement data from all over the Internet. The project is drowning in data, and too frequently has to work too hard to acquire any of data's derivatives: information, knowledge, and wisdom. In the past, issues were found because there were operator rumors of bad behavior. Depending on intensive investigations by humans spurred by rumor-mill dross is not a sustainable solution. As a process it is too random to get started reliably, and for its execution it depends on a very scarce resource (researcher time).
Signal Searcher (this repo) is an attempt to create an automated solution that will permit the display and discovery of issues relating to end-to-end Internet performance and aid in Internet performance transparency. Once performance and measurement has been made transparent, network management can proceed from a firm footing based on open and auditable data. Signal searcher tries to help bootstrap this process by automatically mining the data.
Let's break down Signal Searcher's mission for our data:
Signal Searcher attempts to find interesting and important signals in MLab's Internet measurement data using computers instead of people.
A signal is interesting if:
- It has patterns that indicate overload
- It gets worse over time
- It changes differently than its cohort
- It is out of control
- It is surprising
A signal is important if:
- It corresponds to an attribute that directly affects a user's Internet experience, e.g. latency, download throughput, and upload throughput.
and either
- It has a large effect on part of the Internet OR
- It affects a large part of the Internet
Each signal is a time-series of measured values. Signals may be (and almost certainly are) aggregates of time-series data. Let us imagine each value from a test as a single row in a table of measured values. Each value has lots of attributes for the test environment and for the measured values that were recorded. In data mining there is the concept of a data cube: a hypercube where each dimension corresponds to a piece of the key specifying a data value. To create a signal, you must specify an operation for every key dimension in the data cube. That is, for every dimension you must "transform" (apply an operation to each data value that e.g. aligns timestamps, transforms IPs to ASNs, or buckets data), "roll up" (combine all data along the given dimension using some operation like "average" or "median"), or "drill down" (specify a particular value or range of values for that particular dimension). For metadata dimensions (dimensions that are neither part of the key or the value under test), there is another operator: "drop", which means to ignore that dimension entirely. Once every dimension has an operation specified, the result will be a table of values. If the operations are specified properly for our purposes, then the result will be a signal: a one-dimensional sequence of values.
Our data is network measurement data, collected at the M-Lab measurement points across the Internet, and stored in BigQuery. The two main categories of data are measurements of performance and measurements of topology. Each test is associated with a litany of metadata (although we would like more metadata in BigQuery) regarding the environment in which the test was conducted. We will need to be able to join topology and performance metrics if we would like to perform any sort of tomography about where in the network any observed problems might lie. Because each performance data test is followed by a traceroute, the topology data can be considered metadata of the performance data. However, the topology data is also interesting in its own right and may also be studied independently of the performance data.
For our first version, we will restrict ourselves to studying just sources and destinations of traffic, leaving more complex topology-sensitive analyses for the future. In the first version we will also restrict ourselves to just the three main measurements of interest for end-to-end user experience on the Internet: Download speed, upload speed, and RTT.
A sample query to get this out of MLab's BigQuery tables might be:
SELECT
web100_log_entry.connection_spec.remote_ip AS server_ip_v4,
web100_log_entry.connection_spec.remote_ip AS client_ip_v4,
8 * (web100_log_entry.snap.HCThruOctetsAcked /
(web100_log_entry.snap.SndLimTimeRwin +
web100_log_entry.snap.SndLimTimeCwnd +
web100_log_entry.snap.SndLimTimeSnd)) AS download_mbps,
(web100_log_entry.snap.SumRTT /
web100_log_entry.snap.CountRTT) AS avg_rtt,
web100_log_entry.snap.MinRTT AS min_rtt,
(web100_log_entry.snap.SegsRetrans /
web100_log_entry.snap.DataSegsOut) AS packet_retransmit_rate
FROM
[plx.google:m_lab.ndt.all]
WHERE
web100_log_entry.snap.SndLimTimeSnd IS NOT NULL
AND web100_log_entry.snap.SndLimTimeCwnd IS NOT NULL
AND web100_log_entry.snap.SndLimTimeRwin IS NOT NULL
AND project = 0
AND web100_log_entry.is_last_entry = True
AND connection_spec.data_direction = 1
AND web100_log_entry.snap.CongSignals > 0
AND web100_log_entry.snap.HCThruOctetsAcked >= 8192
AND (web100_log_entry.snap.State == 1
OR (web100_log_entry.snap.State >= 5
AND web100_log_entry.snap.State <= 11))
AND (web100_log_entry.snap.SndLimTimeRwin +
web100_log_entry.snap.SndLimTimeCwnd +
web100_log_entry.snap.SndLimTimeSnd) >= 9000000
AND (web100_log_entry.snap.SndLimTimeRwin +
web100_log_entry.snap.SndLimTimeCwnd +
web100_log_entry.snap.SndLimTimeSnd) < 3600000000
AND web100_log_entry.snap.CountRTT > 10
We specify a signal by a series of transforms, roll-ups, and drill downs. So the space we are searching for interesting signals is the set of all possible transforms, roll-ups, and drill-downs applied to each dimension in our data set. The size of this search space grows exponentially with the number of dimensions a measurement has ("the curse of dimensionality") and multiplicatively with the number of values in each dimension. Even with computational help, this is too large a space to use brute force. Instead, we will specify roll-ups and drill-downs of interest and then use the derived time series data.
To cover the project's remit of general search, we will train an ML model on our data and then run any new data through the model to see if the new results are mispredicted (and therefore "surprising", which can be thought of as "potentially interesting").
We use this insight to build a system which goes through our database looking for data streams of interest that violate at least one of these principles. Because of our focus on end-to-end user experience, we restrict ourselves to searching for signals in metrics which affect the end-user experience: Download speed, upload speed, and RTT.
Expanding on the initial super-generic statement, we come up with the following assertions about the properties a healthy and open Internet should have:
-
Performance for a given ISP or ISP pair should not vary diurnally. Diurnal performance variation was discovered to be a leading indicator of network congestion problems.
-
Performance should not get worse over time. If we ever find performance to be degrading, then that is an interesting signal that we should investigate.
-
For a given access ISP, performance should be the same to each content ISP in a given metro. If, for access ISP X and content ISPs Y and Z, the expected (X,Y) performance is statistically significantly different from the (X,Z) performance, then we have discovered an Internet bottleneck outside of the last mile in X!
-
Performance measurements should be under statistical control. If the variation in expected performance suddenly spikes, then that is an important signal worth investigating.
-
If we have a trained model of past performance, then it should be able to predict current performance with high accuracy. If the predictions of our model are very wrong, then it is likely that we are seeing something new happening on the Internet.
In all of the above, we use the word "performance" and we don't say whether it should go up or down, because RTT should go down, and bandwidth should go up, but either way we can still say that performance should not get worse.
Signal searcher will operate in batch mode. The time period between batch runs will be determined by the runtime of the system - if the batch job takes days to complete, then it should be run less frequently than if it only takes minutes. As command line parameters, signal searcher will take in the IP ranges of interest and the time period of interest.
The output of signal searcher will be a performance report describing any observed problems in that specified subset of the data. The report will clearly have to go through multiple UX iterations, but it will initially be designed to output graphs similar to those in the M-Lab interconnection report.
In the explanations below, whenever we specify a time-series, what we mean is the raw data, bucketed appropriately, with the median operator applied to each bucket.
One of the strongest findings in the report was that diurnal performance was a strong indicator of network overload. Therefore, our first metric of interest will be to load the time-varying signal of a single ISP's performance and to analyze to see if it has a strong 24-hour cycle.
We will do this by performing a Fourier transform of each time series of performance and then squaring the absolute value of the result to get a power curve. We then look in this curve to see if it has a bump at the peak corresponding to a 24 hour frequency.
Performance should not get worse over time. If we discover an instance where performance does seem to get worse over time, then we should investigate.
If we discover that the same access ISP has significantly different performance to different content ISPs within a metro area, then that is strong evidence that in at least some instances the user's experienced performance is being affected by a bottleneck that is not part of their last mile.
We can test equivalence using the Kolmogorov-Smirnov test which will estimate how different two distributions might be. For the time period of interest, we query to find the distribution of our data and then use the K-S test to determine how likely it is that the two distributions are the same.
Use process control theory to evaluate whether the fluctuations of our time series represent standard levels of randomness around the same statistical mean or whether they represent the system transitioning to a new, different, "out of control" state.
This depends on a currently-unpublished paper, and so is blocked until that work is published.
The state of the art machine learning system is Google's TensorFlow. So, we should build a TensorFlow system, train and validate it on our data according to standard best practices, and then ensure that our data set of interest does not hold any surprises for the trained system. In this context, surprises for our ML system would imply that some aspect of the data has changed in an important way.
(N.B. TensorFlow is only available in C++ or Python flavors, so this means that the project as a whole needs to be written in C++ or written in Python or to swim upstream. Python it is.)
The resulting reports from Signal Searcher should be emitted in priority order. An item of interest can be of high priority because it affects a large portion of the Internet, or because it strongly effects a part of the Internet. An item which does both would then be of the highest priority. The relationship between size and severity in determining priority can only be determined empirically, so we leave the exact ranking function unspecified until we have several unordered reports to provide us a basis from which to reason.
The format of the reports is unspecified, but should at a minimum contain graphs of the signals we claim are interesting and an indication of why they are interesting.
We will begin by using BigQuery resources wastefully. For every site, write a query to get every NDT result to that site. Then, run the resulting time series through each of: The 24-hour tests, the non-degradation test, and the statistical control test.
Once those three tests are implemented, we should look for equivalent performance in a metro area by using the GeoIP of each test as well as the origin AS for each test. Finally, we will train a TensorFlow model to predict performance and then run each new measurement through that same model.
The suggested implementation order is:
- First phase: reproduce MLab Interconnection report analyses
- Querying to get a single time series. (half done)
- 24-hour cycles
- Initial output reports
- Second phase: reproduce the more-understandable (but less sensitive) analyses of performance degradation
- Performance non-degradation
- Ensure Statistical control
- Improved output reports
- Third phase: end-to-end verification of neutrality of interconnections within a metro
- Querying to get a performance distribution
- Equivalent performance within a metro
- Fourth phase: using machine learning to find surprises
- Querying to get everything
- TensorFlow prediction model
- Cleaned up output reports
Note that once querying to get a time-series works, implementing the time-series tests can proceed in parallel. Also, output reports should be constructed immediately once we have a single test working, and then should be refactored once we have multiple tests working, and finally should be cleaned up and made into something we are proud of once all tests are working.
-
Apply the tools of granger causation in an attempt to discover what patterns end up being leading indicators for these metrics.
-
Online signal searcher, which informs of Internet measurement changes in real-time by evaluating each incoming measurement and deciding whether or not to fire an alert.
-
See what other metrics in our data set may be informative. Web100 has a lot of variables that are largely unexamined; maybe one of them holds the key to understanding Internet performance in the large.
-
Use paris-traceroute path information to build up more in-depth analyses which attempt to localize performance problems in the network. In particular, each measurement can be read as an attestation of the minimum performance available to each link on the path. By combining these measurements we should be able to discover what links in poorly-performing paths do not have another measurement attesting to their quality.