Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Un-parametrize" webcrawler tests #2966

Merged
merged 5 commits into from
Sep 24, 2024
Merged

Conversation

lunkwill42
Copy link
Member

@lunkwill42 lunkwill42 commented Sep 12, 2024

This is a necessary step to move the test web server setup from tests/integration/conftest.py to a test suite fixture in the future:

Before this PR, a reachability test and a HTML validation test was dynamically generated for each reachable page, by crawling a NAV web server at test discovery time, and using this to parametrize the reachability and validation tests (which sort of artificially inflated the number of tests in the test suite).

However, it is much more useful to make a resource fixture for the web server, so tests that need it can declare their dependency on it. Even if it were possible to use a fixture as input to test parametrization, it would mean that the test discovery phase was still dependent on a web server already running.

This leaves us with the suggestion of this PR: Make the web crawler itself a fixture that generates a list of pages, and have a single validation test and a single reachability test that loops over all the generated results of this fixture.

This vastly reduces the total number of tests in the test suite - and the number will not "randomly" increase every time some new web pages become crawl-able. It also considerably reduces the time spent in the test discovery phase, as the crawler will only run if the current test session needs it.

Later, we can make the crawler fixture depend on a web server fixture, thereby also ensuring a web server is only started if selecting tests that need it (instead of the current solution: which always starts a web server if any integration test is selected).

This was extracted and made independent from #2675

Copy link

github-actions bot commented Sep 12, 2024

Test results

    9 files      9 suites   7m 55s ⏱️
2 109 tests 2 109 ✅ 0 💤 0 ❌
3 957 runs  3 957 ✅ 0 💤 0 ❌

Results for commit 809a18a.

♻️ This comment has been updated with latest results.

Copy link

codecov bot commented Sep 12, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 56.83%. Comparing base (8ae4170) to head (809a18a).
Report is 531 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2966      +/-   ##
==========================================
+ Coverage   56.60%   56.83%   +0.22%     
==========================================
  Files         602      602              
  Lines       43713    43715       +2     
  Branches       48       48              
==========================================
+ Hits        24744    24845     +101     
+ Misses      18957    18858      -99     
  Partials       12       12              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Base automatically changed from test/move-snmpsim-requirement to master September 12, 2024 16:42
While pytest can accomplish a lot of exciting things, it cannot use
fixtures as input to test parametrization.  While we can make a test
depend on a fixture for getting a running webserver instance, the test
discovery phase that generates the tests cannot.

I.e. we cannot get our test parameters from the webcrawler unless the
web server was already up and running when the test discovery phase
started.  Sad, but true.

This changes the webcrawler into a fixture, and changes the web link
reachability and HTML validation tests to iterate the pages provided
by this session-scoped crawler.

This also considerably shortens the discovery phase, since the crawling
actually takes place during the running of the first test that uses
the fixture.
@lunkwill42 lunkwill42 force-pushed the test/webcrawler-as-single-test branch from 0d24838 to e820b88 Compare September 20, 2024 12:44
This is mostly relevant for devs.
Copy link

github-actions bot commented Sep 20, 2024

🦙 MegaLinter status: ✅ SUCCESS

Descriptor Linter Files Fixed Errors Elapsed time
✅ PYTHON black 987 0 23.93s
✅ PYTHON flake8 987 0 434.84s

See detailed report in MegaLinter reports

MegaLinter is graciously provided by OX Security

@lunkwill42 lunkwill42 self-assigned this Sep 20, 2024
@lunkwill42 lunkwill42 requested a review from a team September 20, 2024 12:56
@lunkwill42 lunkwill42 marked this pull request as ready for review September 20, 2024 12:56
Copy link
Contributor

@johannaengland johannaengland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments, I think this is a good idea generally, but IMO we should not stop at the first page that is not reachable/invalid HTML, but rather collect all the errors and show them at the end

if page.response != 200:
# No need to fill up the test report files with contents of OK pages
print(_content_as_string(page.content))
assert page.response == 200, "{} is not reachable".format(page.url)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only complaint here is that it stops as soon as the first page is not reachable. Before you could see all pages that aren't reachable, which might help when trouble shooting, to easier figure out what is broken

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


assert not errors, "Found following validation errors:\n" + errors
assert not errors, "{} did not validate as HTML".format(page.url)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here as above

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@lunkwill42
Copy link
Member Author

See comments, I think this is a good idea generally, but IMO we should not stop at the first page that is not reachable/invalid HTML, but rather collect all the errors and show them at the end

Yeah. I would tend to agree :)

The page reachability and validation tests would both exit on the first
page that fails the check.  This is different to when each page
generated a dynamic test.  It was rightly-so pointed out in code
review that this may mask multiple failures, and that all failures
should be output.
For some HTTPErrors, it seems the original crawler code overloads the
meaning of the Page.content_type attribute, putting an exception
in there instead.  Not entirely sure what the reasoning could be, so
this just adds handling of that potential case to the `should_validate`
function.
@lunkwill42
Copy link
Member Author

It turns out that it is exceedingly difficult to provoke an HTML validation failure to test that this still works. Our default config suppresses tidy warnings in favor of only fetching tidy errors. However, it seems most problems tidy finds in HTML is classed as a warning, and not an error (HTML5 is very forgiving).

We may want to reconsider the warning-suppression after a team discussion.

If there are multiple unreachable or non-validated pages, pytest will
eventually truncate the list it prints.  This explicitly adds an
assertion message that should include the full list of failures,
separated by newlines.
@lunkwill42 lunkwill42 force-pushed the test/webcrawler-as-single-test branch from c16b1f7 to 809a18a Compare September 24, 2024 10:31
@lunkwill42 lunkwill42 merged commit fb9353f into master Sep 24, 2024
13 checks passed
@lunkwill42 lunkwill42 deleted the test/webcrawler-as-single-test branch September 24, 2024 10:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants