-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop making validation tests for non HTML content #2623
Stop making validation tests for non HTML content #2623
Conversation
The web crawler would generate HTML validation tests for all reachable pages, and marked them as *skipped* if the page is blacklisted or the retrieved page has a content-type that doesn't appear to be HTML-related. Skipped tests should be an indicator of some unusual situation or missing condition that prevents the test from running. However, non-HTML pages can never be validated as such, and the *normal* situation is that these tests are useless. Since a lot of the crawled content isn't actually HTML, a large number of SKIP results is produced in test reports. These results only serve to hide the results of tests that were skipped for *factually* unexpected or extraordinary reasons. This change will filter crawled pages based on blacklist status and content-type before it the result is used to parametrize the `test_page_should_be_valid_html` test.
Codecov Report
@@ Coverage Diff @@
## master #2623 +/- ##
==========================================
+ Coverage 54.20% 54.52% +0.31%
==========================================
Files 558 558
Lines 40634 40644 +10
==========================================
+ Hits 22026 22160 +134
+ Misses 18608 18484 -124 see 14 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be removed since we call should_validate
in crawl_only_html
, or am I not understanding this correctly?
Indeed, you are correct. I was probably thinking "better safe than sorry", but this condition (and the one above it) should never be triggered any more. |
1. As mentioned in review comments, `test_page_should_be_valid_html` no longer needs to test whether a page should be validated, since its input is now guaranteed to be filtered. 2. `should_validate()` now performs both filtering checks: A blacklisted page should not be validated, and a non-HTML page should not be validated. 3. With the above changes, `crawl_only_html()` can now be refactored to a one-liner.
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all looks sensible except for the names. Approved as is, but as later polish, rename crawl
, crawl_only_html
and should_validate
. Naming, it is hard, yes.
should_validate
checks if something is crawlable, so is_crawlable
is a better name.
crawl_only_html
makes the actual list of urls to crawl, so maybe urls_to_crawl
, crawl
or filtered_crawl
.
The actual crawling is now done by crawl
. It is only used by crawl_only_html
, so it should at least start with an underscore.
You might get this sense of things if you only read the diff and not the whole code, @hmpf.
|
I read more than the diff, but not everything, no.
It looks like it is crawled twice this way. Unless pytest runs the fixture before any tests use it?
So rename it |
Agreed. I have another branch where I am re-working the test suite to properly use fixtures for external dependencies, so that we don't need to run the entire test suite from within a very specific Docker environment. I have reworked the crawler tests somewhat there, and any such naming cleanups would fit nicely there.
The blacklist is actually a global constant of the module. It's currently empty, but IIRC it's there because it once used to include pages from 3rd party tools that were mounted on the NAV site (i.e. we had no control over their HTML, but they still needed to be reachable in a complete site). |
The web crawler would generate HTML validation tests for all reachable pages, and marked them as skipped if the page is blacklisted or the retrieved page has a content-type that doesn't appear to be HTML-related.
Skipped tests should be an indicator of some unusual situation or missing condition that prevents the test from running. However, non-HTML pages can never be validated as such, and the normal situation is that these tests are useless. Since a lot of the crawled content isn't actually HTML, a large number of SKIP results is produced in test reports. These results only serve to hide the results of tests that were skipped for factually unexpected or extraordinary reasons.
This change will filter crawled pages based on blacklist status and content-type before the result is used to parametrize the
test_page_should_be_valid_html
test.