Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fetch results of evening voting sessions #957

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

tillprochaska
Copy link
Collaborator

@tillprochaska tillprochaska commented Jun 8, 2024

While on most plenary days, there is only one voting session around midday, on some days there is another sesssion in the evening, usually around 17:00. The vote results of the evening sessions are appended to the same source document that also contains the results of the midday votes.

Currently, we only run the RCVListPipeline between 12:00 and 15:00 until we’ve been able to fetch vote results successfully. Once we’ve fetched the results, we do not attempt to fetch them again. That means that so far we did not (automatically) fetch results of the evening voting session.

The easy approach to this problem would be to simply keep running the RCVListPipeline until the evening, even after it has successfully completed for the first time. However, this approach has a few disadvantages:

  • We store timestamps of when we accessed the data sources related for a specific vote. Following the easy approach means we’d store the timestamp when the data was last accessed. For diagnostics and analysis, it’s nice to keep the timestamp of the first access.
  • Every pipeline triggers multiple requests for each vote. As it’s not uncommon to see hundreds of votes per day, this would be quite wasteful and not appropriate for a well-behaved scraper.

Instead, we do the following:

  • Between 12:00 and 15:00, we periodically check if the vote results are available until we’ve been able to fetch the vote results successfully (just as we did before).
  • Between 17:00 and 20:00, we periodically check if an updated RCV list has been published. When an updated list has been published, we rerun the entire pipeline and stop the periodic checks afterwards. When the list remains unchanged, we don’t rerun subsequent steps of the pipeline.

This is implemented as follows:

  • When a pipeline has run successfully, it can return a checksum. The checksum is stored and is provided to subsequent runs of the pipeline.
  • The checksum can be any value, but in case of the RCVListPipeline, it is a simple hash of the contents of the RCV list. (In theory, we could make use of HTTP caching mechanisms and try to e.g. store the response ETag. In practice, many of the official sources do not include an ETag in responses, and this would add unnecessary complexity for irrelevant savings.)
  • If during a subsequent run, the hash of RCV list contents hasn’t changed, the pipeline exists early to skip unnecessary requests. Once the pipeline has been run successfully for a second time, we also stop periodically checking if the RCV list contents have changed.

Fixes #917

@tillprochaska tillprochaska marked this pull request as draft June 8, 2024 15:41
@tillprochaska tillprochaska force-pushed the 917-fix-multiple-voting-sessions branch 2 times, most recently from fac4ac1 to 6a43010 Compare June 8, 2024 16:08
@tillprochaska tillprochaska force-pushed the 917-fix-multiple-voting-sessions branch 6 times, most recently from 4236764 to efa13e3 Compare December 8, 2024 15:19
@tillprochaska tillprochaska marked this pull request as ready for review December 8, 2024 15:37
@tillprochaska tillprochaska requested a review from linusha December 8, 2024 15:37
Comment on lines 21 to +30
class PipelineError(Exception):
pass


class DataUnavailableError(PipelineError):
pass


class DataUnchangedError(PipelineError):
pass
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: These aren’t actually errors, they are exceptions used for flow control. I think I’ve blindly followed the linter recommendation, but this naming doesn’t really make sense for this case.

According to PEP 8, exception classes should have an Error suffix, but only if they actually are errors, otherwise they should have no special suffix, so in this case DataUnchanged and DataUnavailable would be more appropriate.

https://peps.python.org/pep-0008/#exception-names

It probably makes sense to fix this separately though, there might be other similar cases elsewhere.

While on most plenary days, there is only one voting session around midday, on some days there is another sesssion in the evening, usually around 17:00. The vote results of the evening sessions are appended to the same source document that also contains the results of the midday votes.

Currently, we only run the `RCVListPipeline` between 12:00 and 15:00 until we’ve been able to fetch vote results successfully. Once we’ve fetched the results, we do not attempt to fetch them again. That means that so far we did not (automatically) fetch results of the evening voting session.

In addition to the current behavior, this change tries to fetch vote results between 17:00 and 20:00, until we’ve been able to fetch them successfully a second time. This is only the first part of the solution, as we also need to check that we only stop scraping vote results once we’ve been able to fetch updated results (e.g. by storing a hash of the source data for every successful pipeline run).
@tillprochaska tillprochaska force-pushed the 917-fix-multiple-voting-sessions branch 2 times, most recently from f61bcda to dcfc2b8 Compare December 29, 2024 12:25
…xit early if data has not changed

This allows us to repeatedly run the pipeline to check if the data has been updated, while stopping the pipeline as soon as possible if it remains unchanged. In case of this pipeline that means we send one request for the RCV list in XML format, but we do not send a requests to fetch pages from EUR-Lex, OEIL, etc. for each of the votes if the RCV list hasn’t changed.
This is analogous to the `BaseScraper` class for scrapers. In contrast to the scrapers, pipelines don’t share a lot of code (except for some error handling and logging), so there wasn’t really a need to add this before. However, having a common base class is also helpful for typing.
In preparation for changes in the next commit to avoid ambiguous naming
Right now, the worker’s `schedule_pipeline` method also handles all the exceptions that might be raised during pipeline execution, e.g. `DataUnavailableError`. This has worked fine so far, but it’s getting a little difficult to test now that we also want to store the checksum of a pipeline run.

So instead, pipelines now return a `PipelineResult`, which includes information about the status (i.e. whether the run was successful or failed because the source data was unavailable, etc.) and `schedule_pipeline` doesn’t need to handle the different ways a pipeline could fail.
I initially added some code to `BaseScraper` that automatically computes the response checksum and exposes it as a property. This is only ever used by pipelines, so in order to remove some indirection, I’m moving it to a pipelines helper.
All three are only used by other modules from the same package
@tillprochaska tillprochaska force-pushed the 917-fix-multiple-voting-sessions branch from dcfc2b8 to c1dc7c4 Compare December 29, 2024 12:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Missing votes
1 participant