Automate additional data sources using GHA #284

e-belfer · 2024-02-20T14:10:53Z

Overview

Partially addresses #276

What problem does this address?
Adds all archivers currently running out of the box to GHA, and sorts file_changes to improve the legibility of reported changes.

What did you change in this PR?

Adding remaining data sources to the run-archiver method to prevent stale archives, and adding some quality of life improvements to the summary.
Ran mshamines locally with refresh-metadata flag to update keywords that were causing an error, but did not resolve previous_version error so removed this archive from the list and moved into Fix broken archivers #285.
Manually approved new archives for mshamines and epacamd_eia to fix previous_version errors.
Updated mshamines partition name to form from dataset to avoid downstream errors in extraction

Out of scope:

Any additional required debugging is catalogued in Fix broken archivers #285

Testing

How did you make sure this worked? How can a reviewer verify this?
Run run-archiver in GHA. All datasets should pass.

To-do list

Tasks

Give feedback

Review the PR yourself and call out any questions or issues you have
Options

zaneselvans · 2024-02-20T14:36:08Z

.github/workflows/run-archiver.yml

+          - ferc1
+          - ferc2
+          - ferc6
+          - ferc60
+          - ferc714


Do we still expect the FERC XBRL archives will always appear completely new, even if nothing has changed in their contents?

IIRC they were doing something like autogenerating new IDs for every post in the RSS feed every time the feed was read, rather than using persistent unique IDs for each post. Did we find some way around that?

FERC archivers are certainly not working at present, but I'm tracking this in #285. Maybe this is a known and intended failure and I'm missing something, in which case these archivers shouldn't be candidates for automation.

Hmm, #285 looks like real brokenness in the FERC archivers.

If we haven't addressed the unique ID thing, we'd just see that all the FERC XBRL archives get updated every time the archiver is run, but I think we'd still be able to get an idea of how much new data there is from the change in the size of the archives, and saving interim data doesn't seem like a bad idea given how flaky FERC's data curation is!

…t work yet

src/pudl_archiver/archivers/validate.py

.github/workflows/run-archiver.yml

e-belfer added 2 commits February 20, 2024 09:06

Sort files in summary, try adding all the datasets

015f8b7

Temporarily turn off CEMS

52abb57

e-belfer added the automation Issues relating to automated archiver runs label Feb 20, 2024

e-belfer self-assigned this Feb 20, 2024

e-belfer linked an issue Feb 20, 2024 that may be closed by this pull request

Automate remaining archive runs #276

Closed

11 tasks

Merge branch 'main' into automate-everything

52edcd1

zaneselvans reviewed Feb 20, 2024

View reviewed changes

Fix eiawater DOI, fix nonexistent dataset, remove archivers that don'…

d17913d

…t work yet

e-belfer changed the title ~~Automate everything! Add remaining data sources to GHA~~ Automate additional data sources using GHA Feb 20, 2024

e-belfer added 2 commits February 20, 2024 10:00

Clean up and fix eiawater DOI again

fb33fe4

Remove eiawater due to zenodo malfunction

f8cb652

e-belfer mentioned this pull request Feb 20, 2024

Automate remaining archive runs #276

Closed

11 tasks

Restore CEMS run

344f82a

e-belfer requested a review from zaneselvans February 20, 2024 15:13

zaneselvans requested changes Feb 20, 2024

View reviewed changes

src/pudl_archiver/archivers/validate.py Show resolved Hide resolved

.github/workflows/run-archiver.yml Show resolved Hide resolved

Add back CAMD crosswalk and MSHA mines

24156b5

e-belfer requested a review from zaneselvans February 20, 2024 16:26

zaneselvans approved these changes Feb 20, 2024

View reviewed changes

e-belfer merged commit ce0f16d into main Feb 20, 2024
15 of 16 checks passed

e-belfer deleted the automate-everything branch February 20, 2024 16:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automate additional data sources using GHA #284

Automate additional data sources using GHA #284

e-belfer commented Feb 20, 2024 •

edited

Loading

Tasks

zaneselvans Feb 20, 2024

e-belfer Feb 20, 2024 •

edited

Loading

zaneselvans Feb 20, 2024

Automate additional data sources using GHA #284

Automate additional data sources using GHA #284

Conversation

e-belfer commented Feb 20, 2024 • edited Loading

Overview

Testing

To-do list

Tasks

zaneselvans Feb 20, 2024

Choose a reason for hiding this comment

e-belfer Feb 20, 2024 • edited Loading

Choose a reason for hiding this comment

zaneselvans Feb 20, 2024

Choose a reason for hiding this comment

e-belfer commented Feb 20, 2024 •

edited

Loading

e-belfer Feb 20, 2024 •

edited

Loading