Stop making validation tests for non HTML content #2623

lunkwill42 · 2023-05-25T06:25:05Z

The web crawler would generate HTML validation tests for all reachable pages, and marked them as skipped if the page is blacklisted or the retrieved page has a content-type that doesn't appear to be HTML-related.

Skipped tests should be an indicator of some unusual situation or missing condition that prevents the test from running. However, non-HTML pages can never be validated as such, and the normal situation is that these tests are useless. Since a lot of the crawled content isn't actually HTML, a large number of SKIP results is produced in test reports. These results only serve to hide the results of tests that were skipped for factually unexpected or extraordinary reasons.

This change will filter crawled pages based on blacklist status and content-type before the result is used to parametrize the test_page_should_be_valid_html test.

The web crawler would generate HTML validation tests for all reachable pages, and marked them as *skipped* if the page is blacklisted or the retrieved page has a content-type that doesn't appear to be HTML-related. Skipped tests should be an indicator of some unusual situation or missing condition that prevents the test from running. However, non-HTML pages can never be validated as such, and the *normal* situation is that these tests are useless. Since a lot of the crawled content isn't actually HTML, a large number of SKIP results is produced in test reports. These results only serve to hide the results of tests that were skipped for *factually* unexpected or extraordinary reasons. This change will filter crawled pages based on blacklist status and content-type before it the result is used to parametrize the `test_page_should_be_valid_html` test.

github-actions · 2023-05-25T06:27:50Z

Test results

    12 files     12 suites 12m 32s ⏱️
3 152 tests 3 152 ✔️ 0 💤 0 ❌
8 931 runs 8 931 ✔️ 0 💤 0 ❌

Results for commit 40ed7e3.

♻️ This comment has been updated with latest results.

codecov · 2023-05-25T06:46:32Z

Codecov Report

Merging #2623 (40ed7e3) into master (4672022) will increase coverage by 0.31%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #2623      +/-   ##
==========================================
+ Coverage   54.20%   54.52%   +0.31%     
==========================================
  Files         558      558              
  Lines       40634    40644      +10     
==========================================
+ Hits        22026    22160     +134     
+ Misses      18608    18484     -124

see 14 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

johannaengland

https://github.com/Uninett/nav/pull/2623/files#diff-3fe8b88842890c3fc40bf5de74cf87a0bf99537ed5ae7c50f35958ebebc44ff5L265-L266

This could be removed since we call should_validate in crawl_only_html, or am I not understanding this correctly?

lunkwill42 · 2023-05-25T12:40:24Z

https://github.com/Uninett/nav/pull/2623/files#diff-3fe8b88842890c3fc40bf5de74cf87a0bf99537ed5ae7c50f35958ebebc44ff5L265-L266

This could be removed since we call should_validate in crawl_only_html, or am I not understanding this correctly?

Indeed, you are correct. I was probably thinking "better safe than sorry", but this condition (and the one above it) should never be triggered any more.

1. As mentioned in review comments, `test_page_should_be_valid_html` no longer needs to test whether a page should be validated, since its input is now guaranteed to be filtered. 2. `should_validate()` now performs both filtering checks: A blacklisted page should not be validated, and a non-HTML page should not be validated. 3. With the above changes, `crawl_only_html()` can now be refactored to a one-liner.

sonarqubecloud · 2023-05-26T12:40:32Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
0.0% Duplication

hmpf

This all looks sensible except for the names. Approved as is, but as later polish, rename crawl, crawl_only_html and should_validate. Naming, it is hard, yes.

should_validate checks if something is crawlable, so is_crawlable is a better name.

crawl_only_html makes the actual list of urls to crawl, so maybe urls_to_crawl , crawl or filtered_crawl.

The actual crawling is now done by crawl. It is only used by crawl_only_html, so it should at least start with an underscore.

lunkwill42 · 2023-06-05T08:33:12Z

This all looks sensible except for the names. Approved as is, but as later polish, rename crawl, crawl_only_html and should_validate. Naming, it is hard, yes.

should_validate checks if something is crawlable, so is_crawlable is a better name.

crawl_only_html makes the actual list of urls to crawl, so maybe urls_to_crawl , crawl or filtered_crawl.

The actual crawling is now done by crawl. It is only used by crawl_only_html, so it should at least start with an underscore.

You might get this sense of things if you only read the diff and not the whole code, @hmpf.

crawl performs the full crawl of the web site and returns all pages that it found a link to. crawl is still used as the input parameter to test_link_should_be_reachable, as we do want to test all (internal) links for reachability.

should_validate does not verify that something is crawlable, it verifies whether a page that has already been crawled should be validated as HTML (images, CSS files, JavaScript should not be validated as HTML, but they should still be tested for reachability by test_link_should_be_reachable).

hmpf · 2023-06-05T08:58:14Z

This all looks sensible except for the names.

You might get this sense of things if you only read the diff and not the whole code, @hmpf.

I read more than the diff, but not everything, no.

crawl performs the full crawl of the web site and returns all pages that it found a link to. crawl is still used as the input parameter to test_link_should_be_reachable, as we do want to test all (internal) links for reachability.

It looks like it is crawled twice this way. Unless pytest runs the fixture before any tests use it?

should_validate does not verify that something is crawlable, it verifies whether a page that has already been crawled should be validated as HTML (images, CSS files, JavaScript should not be validated as HTML, but they should still be tested for reachability by test_link_should_be_reachable).

So rename it should_be_valid_html. When are pages put on the blacklist?

lunkwill42 · 2023-06-05T15:36:08Z

crawl performs the full crawl of the web site and returns all pages that it found a link to. crawl is still used as the input parameter to test_link_should_be_reachable, as we do want to test all (internal) links for reachability.

It looks like it is crawled twice this way. Unless pytest runs the fixture before any tests use it?

crawl() caches its output and responds from the cache on subsequent calls (after all, WebCrawler was written specifically for use with our test suite).

should_validate does not verify that something is crawlable, it verifies whether a page that has already been crawled should be validated as HTML (images, CSS files, JavaScript should not be validated as HTML, but they should still be tested for reachability by test_link_should_be_reachable).

So rename it should_be_valid_html.

Agreed.

I have another branch where I am re-working the test suite to properly use fixtures for external dependencies, so that we don't need to run the entire test suite from within a very specific Docker environment. I have reworked the crawler tests somewhat there, and any such naming cleanups would fit nicely there.

When are pages put on the blacklist?

The blacklist is actually a global constant of the module. It's currently empty, but IIRC it's there because it once used to include pages from 3rd party tools that were mounted on the NAV site (i.e. we had no control over their HTML, but they still needed to be reachable in a complete site).

lunkwill42 added the tests label May 25, 2023

lunkwill42 self-assigned this May 25, 2023

lunkwill42 requested review from hmpf, stveit, johannaengland and podliashanyk May 25, 2023 06:26

johannaengland requested changes May 25, 2023

View reviewed changes

lunkwill42 requested a review from johannaengland May 26, 2023 12:39

hmpf approved these changes Jun 5, 2023

View reviewed changes

lunkwill42 merged commit 08236b7 into Uninett:master Jun 5, 2023

lunkwill42 deleted the test/do-not-generate-validation-tests-for-non-html branch June 5, 2023 08:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop making validation tests for non HTML content #2623

Stop making validation tests for non HTML content #2623

lunkwill42 commented May 25, 2023 •

edited

Loading

github-actions bot commented May 25, 2023 •

edited

Loading

codecov bot commented May 25, 2023 •

edited

Loading

johannaengland left a comment

lunkwill42 commented May 25, 2023

sonarqubecloud bot commented May 26, 2023

hmpf left a comment •

edited

Loading

lunkwill42 commented Jun 5, 2023

hmpf commented Jun 5, 2023

lunkwill42 commented Jun 5, 2023

Stop making validation tests for non HTML content #2623

Stop making validation tests for non HTML content #2623

Conversation

lunkwill42 commented May 25, 2023 • edited Loading

github-actions bot commented May 25, 2023 • edited Loading

Test results

codecov bot commented May 25, 2023 • edited Loading

Codecov Report

johannaengland left a comment

Choose a reason for hiding this comment

lunkwill42 commented May 25, 2023

sonarqubecloud bot commented May 26, 2023

hmpf left a comment • edited Loading

Choose a reason for hiding this comment

lunkwill42 commented Jun 5, 2023

hmpf commented Jun 5, 2023

lunkwill42 commented Jun 5, 2023

lunkwill42 commented May 25, 2023 •

edited

Loading

github-actions bot commented May 25, 2023 •

edited

Loading

codecov bot commented May 25, 2023 •

edited

Loading

hmpf left a comment •

edited

Loading