Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run scrapers against live data sources #1059

Merged
merged 8 commits into from
Dec 15, 2024

Conversation

tillprochaska
Copy link
Collaborator

@tillprochaska tillprochaska commented Nov 17, 2024

Fixes #1057.

The new HTV_TEST_MOCK_REQUESTS environment variable can be set to false to disable HTTP request mocking. This means that all HTTP requests initiated during tests will be passed to the original source.

Not all tests can be run easily against live data sources. For example, some tests cover edge cases that cannot be reproduced reliably using live data. The test marker always_mock_requests can be used to override the global setting for individual tests.

For example, the following test will never send a real HTTP request, even when HTV_TEST_MOCK_REQUESTS=false.

@pytest.mark.always_mock_requests
def test_lorem_ipsum(responses):
  responses.get("https://example.org", body="Lorem ipsum")

  assert requests.get("https://example.org").body == "Lorem ipsum"

I have chosen to explicitly mark tests that should never send real HTTP requests (rather than the opposite) to encourage writing tests that can be executed against the live data sources. Writing a (scraper) test that needs to mock HTTP requests should be the exception.

For now, I’ve marked tests covering the RCVListScraper (they cannot be easily executed against the live source and given they are using an XML source they are probably less likely to break compared to scrapers using an HTML source) as well as a few tests covering generic functionality (such as timeout handling).

I have also updated a few scrapers and fixtures because the live source data has changed, although only in one case the scraper was actually broken (geographic areas).

With regards to the implementation (see the responses fixture in conftest.py), I’m a little unsure whether the fact that it’s quite hacky indicates it’s something you shouldn’t do or if it’s simply a use case that isn’t super common.

@tillprochaska tillprochaska force-pushed the 1057-scraper-tests-live-sources branch from b3f1927 to 73e0ffe Compare November 17, 2024 13:05
This isn’t a breaking change in practice as the old URLs still work and redirect to the new URLs.
This test was using an outdated fixture. As the 9th plenary term is over, there are no ongoing group memberships anymore and all group memberships now have an end date. I've updated the test to use Markus Weber's group memberships from the current term. We will have to update the test when something changes (or for the 11th term the latest), but that should be manageable.
The test fixture used in this case was different from the live source. I've replaced it with a copy of the original source.
@tillprochaska tillprochaska force-pushed the 1057-scraper-tests-live-sources branch from 8b64299 to a47a129 Compare November 17, 2024 17:31
@tillprochaska tillprochaska marked this pull request as ready for review November 17, 2024 17:40
@tillprochaska tillprochaska changed the title Optionally disable all HTTP request mocks Run scrapers against live data sources Nov 17, 2024
@tillprochaska tillprochaska merged commit 1ef6c24 into main Dec 15, 2024
2 checks passed
@tillprochaska tillprochaska deleted the 1057-scraper-tests-live-sources branch December 15, 2024 16:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Run scrapers against live source websites in regular interval (nightly/weekly)
1 participant