Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document process for new custom metrics #138

Closed
rviscomi opened this issue Apr 9, 2019 · 7 comments
Closed

Document process for new custom metrics #138

rviscomi opened this issue Apr 9, 2019 · 7 comments

Comments

@rviscomi
Copy link
Member

rviscomi commented Apr 9, 2019

Custom metrics are a powerful way to extract insights from pages directly from the JS runtime context. This is an especially useful time to analyze things that depend on the DOM, for example, like counting HTML element and attribute usage. These are things that can be approximated using regular expressions in the response bodies, but it becomes extremely costly considering that the latest desktop and mobile response_bodies tables are over 11 TB combined.

I'd like to use this issue to discuss what the criteria should be for the custom metrics. What are the limitations of doing more work in custom metrics? Should we be actively cleaning up unused custom metrics? Should we have a policy for allowing one-time custom metrics by request?

cc @bkardell

@bkardell
Copy link

bkardell commented Apr 9, 2019

Thanks so much for opening this @rviscomi - I didn't realize that these were a thing (the fact that they are in 'legacy' made me not pay attention to them) when I asked you about this - but yes, this is very close to what I was asking about for all of the reasons you mention. I don't want to distract from the primary question you're asking, but it's not immediately clear to me how to access or analyze data collected by these? It's hard to contribute usefully to the discussion without that, so I apologize if this is noise or distraction and you can feel free to ignore here and reply privately if that's better..

For example: if I had to guess from this, it seems like each metric records a single value and it's up to it to do everything, so they do their own full tree scans. Does that become a concern if we do too many, or is it simply about storage size since it is duplicating values in a sense to summarize? Or, is it about what you can do with the data once you have it?

I see that they kind of vary - it looks like they all need to return a single value, but some of them like the third-party libraries one seem to return complex, stringified data in that value. This can certainly be used to sum up data into more manageable chunks as long as that doesn't get too complex, I guess. The things I am interested in (tags/attributes) could be reasonably summarized into a similarly stringified value via something like this function, potentially which summarizes the WHATWG HTML single page edition (about 10.4mb currently of HTML markup) into about 6.3k of summary about all of the tags and attributes -- but then how would you analyze that? Regexps again? Or - somehow expand that into wide-columns? If the dataset is small enough that one can reasonably process it programatically, maybe that's ok?

@rviscomi
Copy link
Member Author

rviscomi commented Apr 9, 2019

it's not immediately clear to me how to access or analyze data collected by these

Custom metrics are included in the JSON-encoded HAR payloads in the pages dataset. Here's an example of querying the the num_scripts_async custom metric:

SELECT
  APPROX_QUANTILES(JSON_EXTRACT_SCALAR(payload, '$._num_scripts_async'), 1000)[OFFSET(500)] AS median_async_scripts
FROM
  `httparchive.pages.2019_03_01_desktop`

It's a bit more complicated to unpack JSON-encoded custom metrics like third-parties, but still doable.

For example: if I had to guess from this, it seems like each metric records a single value and it's up to it to do everything, so they do their own full tree scans. Does that become a concern if we do too many, or is it simply about storage size since it is duplicating values in a sense to summarize? Or, is it about what you can do with the data once you have it?

Yes, each custom metric returns a single scalar value like int, boolean, or string. Complex data types like arrays and objects must be JSON-encoded. (@pmeenan sanity check: are complex types compatible with the WPT reporting?)

I do think there's a concern about having too many custom metrics or metric that are computationally expensive. Given that these metrics run on every desktop and mobile test, which is millions of times per month, every second of testing time counts.

The things I am interested in (tags/attributes) could be reasonably summarized into a similarly stringified value via something like this function, potentially which summarizes the WHATWG HTML single page edition (about 10.4mb currently of HTML markup) into about 6.3k of summary about all of the tags and attributes -- but then how would you analyze that? Regexps again? Or - somehow expand that into wide-columns? If the dataset is small enough that one can reasonably process it programatically, maybe that's ok?

6KB of JSON isn't terrible for a single page but I'm curious to see what the aggregate effect is over all ~4M tests for both desktop/mobile. Assuming that's the average case, we should expect each monthly pages table to increase by ~24 GB (~2 % of the entire free BigQuery monthly quota). It's not negligible.

Analyzing that JSON data would be done with the JSON_EXTRACT BigQuery function (see the third-parties analysis for example). User-defined functions (UDF) in BigQuery can do additional heavy lifting of JSON parsing and array iteration (limitations of JSON_EXTRACT) if needed, but hopefully we design the JSON object so those aren't needed (avoiding arrays).

Markup analysis is a good case study for the purposes of this issue. We should figure out:

  • how much JSON data is too much
  • what the JSON format should be
  • whether we can afford tags and attributes or just tags
  • whether we include this custom metric for a single crawl or keep it indefinitely

@bkardell
Copy link

bkardell commented Apr 9, 2019

how much JSON data is too much
what the JSON format should be

JSON is itself not especially efficient here. It's not terrible but it's certainly plausible to save by doing something else which is generally almost as simple to use, (simple key/value pairs should more or less work here) but I guess that's a non-option really? I wouldn't suggest that the html standard example is representative, there are totally very small pages which would generate similar sized documents or even larger and vice-versa - just that it seems to inherently have a (comparatively) pretty small upper bound and provide a lot of data.

whether we can afford tags and attributes or just tags

It seems that there is already collection happening for attributes, and having context seems helpful for a number of questions we try to answer... maybe even makes some of the ones we have just a summary of this summary and less immediately useful? But yes- this would get considerably smaller if it were just tag names and their counts, and this alone I think would be a big step up as it catches anything that parses into tag nicely which you can then look at a lot of ways.

whether we include this custom metric for a single crawl or keep it indefinitely

I would very much like to keep it, or at least include it periodically? Having some grip on this data (even over time) seems important for too many discussions.

@rviscomi
Copy link
Member Author

For the markup metric let's start with JSON for simplicity and we can revisit the format if we need to save on bytes. I agree that kv pairs of tag names and frequency counts are a good start.

@rviscomi
Copy link
Member Author

@igrigorik @pmeenan @paulcalvano I'm looking at the initial list of metrics brainstormed for the Almanac and for many of them I think custom metrics would be useful if not the only practical way. The questions raised in this issue are becoming more important if we're going to consider adding ~dozens of new custom metrics. The high level question to answer is, how much can we realistically do using custom metrics without affecting the crawl?

One use case that comes to mind is the word count per page listed in the SEO chapter. There are many other metrics with complicated implementations in the rest of the chapters.

(@HTTPArchive/data-analysts FYI)

@pmeenan
Copy link
Member

pmeenan commented May 25, 2019 via email

@rviscomi
Copy link
Member Author

rviscomi commented Oct 2, 2020

@rviscomi rviscomi closed this as completed Oct 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants