-
-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document process for new custom metrics #138
Comments
Thanks so much for opening this @rviscomi - I didn't realize that these were a thing (the fact that they are in 'legacy' made me not pay attention to them) when I asked you about this - but yes, this is very close to what I was asking about for all of the reasons you mention. I don't want to distract from the primary question you're asking, but it's not immediately clear to me how to access or analyze data collected by these? It's hard to contribute usefully to the discussion without that, so I apologize if this is noise or distraction and you can feel free to ignore here and reply privately if that's better.. For example: if I had to guess from this, it seems like each metric records a single value and it's up to it to do everything, so they do their own full tree scans. Does that become a concern if we do too many, or is it simply about storage size since it is duplicating values in a sense to summarize? Or, is it about what you can do with the data once you have it? I see that they kind of vary - it looks like they all need to return a single value, but some of them like the third-party libraries one seem to return complex, stringified data in that value. This can certainly be used to sum up data into more manageable chunks as long as that doesn't get too complex, I guess. The things I am interested in (tags/attributes) could be reasonably summarized into a similarly stringified value via something like this function, potentially which summarizes the WHATWG HTML single page edition (about 10.4mb currently of HTML markup) into about 6.3k of summary about all of the tags and attributes -- but then how would you analyze that? Regexps again? Or - somehow expand that into wide-columns? If the dataset is small enough that one can reasonably process it programatically, maybe that's ok? |
Custom metrics are included in the JSON-encoded HAR payloads in the pages dataset. Here's an example of querying the the num_scripts_async custom metric: SELECT
APPROX_QUANTILES(JSON_EXTRACT_SCALAR(payload, '$._num_scripts_async'), 1000)[OFFSET(500)] AS median_async_scripts
FROM
`httparchive.pages.2019_03_01_desktop` It's a bit more complicated to unpack JSON-encoded custom metrics like third-parties, but still doable.
Yes, each custom metric returns a single scalar value like int, boolean, or string. Complex data types like arrays and objects must be JSON-encoded. (@pmeenan sanity check: are complex types compatible with the WPT reporting?) I do think there's a concern about having too many custom metrics or metric that are computationally expensive. Given that these metrics run on every desktop and mobile test, which is millions of times per month, every second of testing time counts.
6KB of JSON isn't terrible for a single page but I'm curious to see what the aggregate effect is over all ~4M tests for both desktop/mobile. Assuming that's the average case, we should expect each monthly Analyzing that JSON data would be done with the Markup analysis is a good case study for the purposes of this issue. We should figure out:
|
JSON is itself not especially efficient here. It's not terrible but it's certainly plausible to save by doing something else which is generally almost as simple to use, (simple key/value pairs should more or less work here) but I guess that's a non-option really? I wouldn't suggest that the html standard example is representative, there are totally very small pages which would generate similar sized documents or even larger and vice-versa - just that it seems to inherently have a (comparatively) pretty small upper bound and provide a lot of data.
It seems that there is already collection happening for attributes, and having context seems helpful for a number of questions we try to answer... maybe even makes some of the ones we have just a summary of this summary and less immediately useful? But yes- this would get considerably smaller if it were just tag names and their counts, and this alone I think would be a big step up as it catches anything that parses into tag nicely which you can then look at a lot of ways.
I would very much like to keep it, or at least include it periodically? Having some grip on this data (even over time) seems important for too many discussions. |
For the markup metric let's start with JSON for simplicity and we can revisit the format if we need to save on bytes. I agree that kv pairs of tag names and frequency counts are a good start. |
@igrigorik @pmeenan @paulcalvano I'm looking at the initial list of metrics brainstormed for the Almanac and for many of them I think custom metrics would be useful if not the only practical way. The questions raised in this issue are becoming more important if we're going to consider adding ~dozens of new custom metrics. The high level question to answer is, how much can we realistically do using custom metrics without affecting the crawl? One use case that comes to mind is the word count per page listed in the SEO chapter. There are many other metrics with complicated implementations in the rest of the chapters. (@HTTPArchive/data-analysts FYI) |
Can you combine a bunch of the checks into a single collection script so
we're not running dozens of scripts? It can return JSON and it should be
passed through. I'd be surprised if even dozens of metrics added more than
a few seconds to the crawl and should be a non-issue but be mindful of any
that look particularly expensive (exponential based on size of DOM for
example).
…On Sat, May 25, 2019 at 4:56 PM Rick Viscomi ***@***.***> wrote:
@igrigorik <https://github.com/igrigorik> @pmeenan
<https://github.com/pmeenan> @paulcalvano <https://github.com/paulcalvano>
I'm looking at the initial list of metrics brainstormed for the Almanac and
for many of them I think custom metrics would be useful if not the only
practical way. The questions raised in this issue are becoming more
important if we're going to consider adding ~dozens of new custom metrics.
The high level question to answer is, *how much can we realistically do
using custom metrics without affecting the crawl?*
One use case that comes to mind is the word count per page listed in the SEO
chapter
<HTTPArchive/almanac.httparchive.org#12 (comment)>.
There are many other metrics with complicated implementations in the rest
of the chapters.
***@***.***/data-analysts
<https://github.com/orgs/HTTPArchive/teams/data-analysts> FYI)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#138?email_source=notifications&email_token=AADMOBKPSGAEKB6BXK2V4BTPXGRYTA5CNFSM4HEOT5S2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWHZBMQ#issuecomment-495947954>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AADMOBO5MOA5ZZGSKWCRP4TPXGRYTANCNFSM4HEOT5SQ>
.
|
Custom metrics are a powerful way to extract insights from pages directly from the JS runtime context. This is an especially useful time to analyze things that depend on the DOM, for example, like counting HTML element and attribute usage. These are things that can be approximated using regular expressions in the response bodies, but it becomes extremely costly considering that the latest desktop and mobile
response_bodies
tables are over 11 TB combined.I'd like to use this issue to discuss what the criteria should be for the custom metrics. What are the limitations of doing more work in custom metrics? Should we be actively cleaning up unused custom metrics? Should we have a policy for allowing one-time custom metrics by request?
cc @bkardell
The text was updated successfully, but these errors were encountered: