Document process for new custom metrics #138

rviscomi · 2019-04-09T05:47:47Z

Custom metrics are a powerful way to extract insights from pages directly from the JS runtime context. This is an especially useful time to analyze things that depend on the DOM, for example, like counting HTML element and attribute usage. These are things that can be approximated using regular expressions in the response bodies, but it becomes extremely costly considering that the latest desktop and mobile response_bodies tables are over 11 TB combined.

I'd like to use this issue to discuss what the criteria should be for the custom metrics. What are the limitations of doing more work in custom metrics? Should we be actively cleaning up unused custom metrics? Should we have a policy for allowing one-time custom metrics by request?

cc @bkardell

The text was updated successfully, but these errors were encountered:

bkardell · 2019-04-09T14:02:27Z

Thanks so much for opening this @rviscomi - I didn't realize that these were a thing (the fact that they are in 'legacy' made me not pay attention to them) when I asked you about this - but yes, this is very close to what I was asking about for all of the reasons you mention. I don't want to distract from the primary question you're asking, but it's not immediately clear to me how to access or analyze data collected by these? It's hard to contribute usefully to the discussion without that, so I apologize if this is noise or distraction and you can feel free to ignore here and reply privately if that's better..

For example: if I had to guess from this, it seems like each metric records a single value and it's up to it to do everything, so they do their own full tree scans. Does that become a concern if we do too many, or is it simply about storage size since it is duplicating values in a sense to summarize? Or, is it about what you can do with the data once you have it?

I see that they kind of vary - it looks like they all need to return a single value, but some of them like the third-party libraries one seem to return complex, stringified data in that value. This can certainly be used to sum up data into more manageable chunks as long as that doesn't get too complex, I guess. The things I am interested in (tags/attributes) could be reasonably summarized into a similarly stringified value via something like this function, potentially which summarizes the WHATWG HTML single page edition (about 10.4mb currently of HTML markup) into about 6.3k of summary about all of the tags and attributes -- but then how would you analyze that? Regexps again? Or - somehow expand that into wide-columns? If the dataset is small enough that one can reasonably process it programatically, maybe that's ok?

rviscomi · 2019-04-09T20:27:28Z

it's not immediately clear to me how to access or analyze data collected by these

Custom metrics are included in the JSON-encoded HAR payloads in the pages dataset. Here's an example of querying the the num_scripts_async custom metric:

SELECT
  APPROX_QUANTILES(JSON_EXTRACT_SCALAR(payload, '$._num_scripts_async'), 1000)[OFFSET(500)] AS median_async_scripts
FROM
  `httparchive.pages.2019_03_01_desktop`

It's a bit more complicated to unpack JSON-encoded custom metrics like third-parties, but still doable.

For example: if I had to guess from this, it seems like each metric records a single value and it's up to it to do everything, so they do their own full tree scans. Does that become a concern if we do too many, or is it simply about storage size since it is duplicating values in a sense to summarize? Or, is it about what you can do with the data once you have it?

Yes, each custom metric returns a single scalar value like int, boolean, or string. Complex data types like arrays and objects must be JSON-encoded. (@pmeenan sanity check: are complex types compatible with the WPT reporting?)

I do think there's a concern about having too many custom metrics or metric that are computationally expensive. Given that these metrics run on every desktop and mobile test, which is millions of times per month, every second of testing time counts.

The things I am interested in (tags/attributes) could be reasonably summarized into a similarly stringified value via something like this function, potentially which summarizes the WHATWG HTML single page edition (about 10.4mb currently of HTML markup) into about 6.3k of summary about all of the tags and attributes -- but then how would you analyze that? Regexps again? Or - somehow expand that into wide-columns? If the dataset is small enough that one can reasonably process it programatically, maybe that's ok?

6KB of JSON isn't terrible for a single page but I'm curious to see what the aggregate effect is over all ~4M tests for both desktop/mobile. Assuming that's the average case, we should expect each monthly pages table to increase by ~24 GB (~2 % of the entire free BigQuery monthly quota). It's not negligible.

Analyzing that JSON data would be done with the JSON_EXTRACT BigQuery function (see the third-parties analysis for example). User-defined functions (UDF) in BigQuery can do additional heavy lifting of JSON parsing and array iteration (limitations of JSON_EXTRACT) if needed, but hopefully we design the JSON object so those aren't needed (avoiding arrays).

Markup analysis is a good case study for the purposes of this issue. We should figure out:

how much JSON data is too much
what the JSON format should be
whether we can afford tags and attributes or just tags
whether we include this custom metric for a single crawl or keep it indefinitely

bkardell · 2019-04-09T20:45:46Z

how much JSON data is too much
what the JSON format should be

JSON is itself not especially efficient here. It's not terrible but it's certainly plausible to save by doing something else which is generally almost as simple to use, (simple key/value pairs should more or less work here) but I guess that's a non-option really? I wouldn't suggest that the html standard example is representative, there are totally very small pages which would generate similar sized documents or even larger and vice-versa - just that it seems to inherently have a (comparatively) pretty small upper bound and provide a lot of data.

whether we can afford tags and attributes or just tags

It seems that there is already collection happening for attributes, and having context seems helpful for a number of questions we try to answer... maybe even makes some of the ones we have just a summary of this summary and less immediately useful? But yes- this would get considerably smaller if it were just tag names and their counts, and this alone I think would be a big step up as it catches anything that parses into tag nicely which you can then look at a lot of ways.

whether we include this custom metric for a single crawl or keep it indefinitely

I would very much like to keep it, or at least include it periodically? Having some grip on this data (even over time) seems important for too many discussions.

rviscomi · 2019-04-27T15:06:01Z

For the markup metric let's start with JSON for simplicity and we can revisit the format if we need to save on bytes. I agree that kv pairs of tag names and frequency counts are a good start.

rviscomi · 2019-05-25T20:56:39Z

@igrigorik @pmeenan @paulcalvano I'm looking at the initial list of metrics brainstormed for the Almanac and for many of them I think custom metrics would be useful if not the only practical way. The questions raised in this issue are becoming more important if we're going to consider adding ~dozens of new custom metrics. The high level question to answer is, how much can we realistically do using custom metrics without affecting the crawl?

One use case that comes to mind is the word count per page listed in the SEO chapter. There are many other metrics with complicated implementations in the rest of the chapters.

(@HTTPArchive/data-analysts FYI)

pmeenan · 2019-05-25T21:32:41Z

Can you combine a bunch of the checks into a single collection script so we're not running dozens of scripts? It can return JSON and it should be passed through. I'd be surprised if even dozens of metrics added more than a few seconds to the crawl and should be a non-issue but be mindful of any that look particularly expensive (exponential based on size of DOM for example).

…

On Sat, May 25, 2019 at 4:56 PM Rick Viscomi ***@***.***> wrote: @igrigorik <https://github.com/igrigorik> @pmeenan <https://github.com/pmeenan> @paulcalvano <https://github.com/paulcalvano> I'm looking at the initial list of metrics brainstormed for the Almanac and for many of them I think custom metrics would be useful if not the only practical way. The questions raised in this issue are becoming more important if we're going to consider adding ~dozens of new custom metrics. The high level question to answer is, *how much can we realistically do using custom metrics without affecting the crawl?* One use case that comes to mind is the word count per page listed in the SEO chapter <HTTPArchive/almanac.httparchive.org#12 (comment)>. There are many other metrics with complicated implementations in the rest of the chapters. ***@***.***/data-analysts <https://github.com/orgs/HTTPArchive/teams/data-analysts> FYI) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#138?email_source=notifications&email_token=AADMOBKPSGAEKB6BXK2V4BTPXGRYTA5CNFSM4HEOT5S2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWHZBMQ#issuecomment-495947954>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AADMOBO5MOA5ZZGSKWCRP4TPXGRYTANCNFSM4HEOT5SQ> .

rviscomi · 2020-10-02T18:36:29Z

Fixed by https://github.com/HTTPArchive/legacy.httparchive.org/blob/master/custom_metrics/README.md

rviscomi mentioned this issue Apr 27, 2019

Add custom metrics for HTML analysis #141

Closed

rviscomi mentioned this issue Jun 30, 2019

Finalize assignments: Chapter 3. Markup HTTPArchive/almanac.httparchive.org#5

Closed

3 tasks

rviscomi closed this as completed Oct 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document process for new custom metrics #138

Document process for new custom metrics #138

rviscomi commented Apr 9, 2019 •

edited

Loading

bkardell commented Apr 9, 2019 •

edited

Loading

rviscomi commented Apr 9, 2019

bkardell commented Apr 9, 2019 •

edited

Loading

rviscomi commented Apr 27, 2019

rviscomi commented May 25, 2019

pmeenan commented May 25, 2019 via email

rviscomi commented Oct 2, 2020

Document process for new custom metrics #138

Document process for new custom metrics #138

Comments

rviscomi commented Apr 9, 2019 • edited Loading

bkardell commented Apr 9, 2019 • edited Loading

rviscomi commented Apr 9, 2019

bkardell commented Apr 9, 2019 • edited Loading

rviscomi commented Apr 27, 2019

rviscomi commented May 25, 2019

pmeenan commented May 25, 2019 via email

rviscomi commented Oct 2, 2020

rviscomi commented Apr 9, 2019 •

edited

Loading

bkardell commented Apr 9, 2019 •

edited

Loading

bkardell commented Apr 9, 2019 •

edited

Loading