-
-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split monolithic ferc_to_sqlite ops into per-dataset pieces #3098
Conversation
Instead of using single monolith op that loops over all forms, we can use ops factory and XbrlRuntimeSettings resource to simplify how stuff is passed in. This way, single runtime settings exists for all xbrl ops and one op is generated for each form. This should allow for better parallelism, even though it might interfere with the num_workers being set to num cpus by default (i.e. this way we will oversubscribe available cores twice, one for dagster workers, and once for xbrl num workers). This, hovewer, should be an easy fix.
Refactor monolithic dbf2sqlite and xbrl2sqlite methods into per-dataset smaller ops that are invoked within the graphs. This should allow us to better make use of dagster parallelism and speed up ferc_to_sqlite processing. It seems that current unit/integration tests only use FERC1 raw data, so I've modified the fixtures to only run the relevant pieces of processing.
I ran this with the pytest refactor that runs https://github.com/catalyst-cooperative/pudl/actions/runs/7064167417
TL;DR - both dagster and xbrl worker parameters provide similar speed-ups, with xbrl workers providing slightly better value/speedups. Note that we can only run with 2 and 4 of each before we run larger runner out of memory. We could, therefore, run with 2 xbrl workers in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for switching up the settings datastructure so we can better split out the assets and run CI faster!
I'm also sort of inclined to rely more on dagster parallelization than xbrl extractor parallelization, because one system is easier to think about than two, but that definitely falls out of scope of this PR.
Finally, I have a few small thoughts about the changes you made to the test config that we should talk about before merging this in.
}, | ||
}, | ||
) | ||
|
||
|
||
@pytest.fixture(scope="session") | ||
def ferc_to_sqlite_xbrl_only( | ||
def ferc1_xbrl_extract( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooh, nice to not have to extract all the other forms in test 🤩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though this is probably irrelevant once we're running ferc_to_sqlite
normally as part of ci-integration
(see #2825) and not as part of integration tests in-process. At that point, we will likely deprecate these fixtures and assume we're always relying on --live-dbs
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that sounds correct to me - though we should consider making sure we can't actually mutate anything in --live-dbs
mode....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we going to rework the testing setup in the Makefile to run ferc_to_sqlite
and pudl_etl
before running the tests too? That would be nice for the sake of consistency (doing the same thing in CI and locally) but would we need a different way of isolating the test outputs from live outputs? If the tests are always run with --live-dbs
then should that option just be removed entirely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say yes to removing --live-dbs
functionality altogether, once we've isolated ferc_to_sqlite/etl from the unit/integration tests.
The obvious question is what is the semantics of running the integration tests locally - do we expect that the outputs are already in place, or do we want to re-run the data-generation (which could be time consuming).
In the latter case, even if we no longer have --live-dbs
, we could set PUDL_OUTPUT
in the Makefile
itself to ensure that outputs that are generated for the sake of testing are isolated from the local outputs.
Happy to have some more conversation about this, but this seems to be out of scope for this specific PR so I would like to not block on that.
test/unit/extract/xbrl_test.py
Outdated
|
||
# Construct xbrl2sqlite op context | ||
context = build_op_context( | ||
# always use tmp ath here so that we don't clobber the live DB when --live-dbs is passed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're no longer mocking PudlPaths()
here - is this comment still relevant? Seems like we're checking at test runtime for --live-dbs
, which is fine I guess but also error-prone - we need to remember to bail out of any db-touching test when live_dbs == True
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the fact that we have the tests operating in these two distinct modes (read-only vs mutating) is a bit dangerous IMO and there's always a possibility that we don't do isolation well and things get mangled. I suppose that, ideally, we would rely on most of the db tests to be validations that will not modify whatever is in place.
Another option for splitting the sizeable FERC extraction work out from all of the other stuff without blowing up the resource use on an individual machine would be to split each of the FERC extractions into its own CI job. I think we only need to upload/download the FERC 1 DBs for the PUDL ETL to run. PUDL wouldn't need to be dependent on any of the other FERC jobs, and we could just upload and combine coverage reports from the non-FERC1 forms. I think this could just be done by splitting the ferc2sqlite params out into separate settings files. The FERC1 takes the most time by a significant margin. It could even be worthwhile to split out the XBRL and DBF extractions into 2 separate jobs (since they're totally independent conceptually, and take roughly the same amount of time to run). I think splitting the work out into separate CI jobs on separate runners would probably cut our ~45min FERC run down to more like 15min and would also cut the time we burn on larger runners almost in half. |
This is basically putting yet another layer of doing the same thing in a different way on top of dagster parallelization; I can see the practical reason for it here (breaking out of the bounds of relatively small GitHub runners) but it's a bit of a dirty trick IMO. If this can save some money, then we could go for it. The other option could be to have |
Yeah, I don't love the additional complexity, but it seems like it could be done with no change to the codebase, result in a significantly bigger speedup, and also save some paid CI minutes, so it seemed worth looking at. And I guess these approaches are not necessarily mutually exclusive. @jdangerx Do you think there are any clear opportunities for speeding up the XBRL extraction jobs at a low level? Right now it takes about the same amount of time to extract 20-30 years of DBF data as 2 years of XBRL data, so it seems like in production this issue is going to get worse over time. |
We can always profile and see if there's anything funky going on - not sure how to prioritize that work over e.g. getting archiver stuff working, again (again). |
I think this is a reasonable idea and I would like to pursue it as well. I'm thinking I could add For resource usage safety, I would also suggest setting default xbrl workers to 2 and rely primarily on the dagster workers for parallelism. |
Rely on dagster parallelism here.
Restrict processing to, say, ferc1_dbf or ferc2_xbrl dataset. This is intended for ci-integration parallelism.
For more information, see https://pre-commit.ci
Implemented |
@rousik there was an error because Now there's a real error related to splitting the FERC jobs up because the Datasette metadata to RST script that generates the |
@zaneselvans can you provide some additional context? Do we expect to run this datasette metadata job as part of |
It seems to me that we could adapt the datasette tests so that they're configuration aware and would only test for presence of files that should have been produced, i.e. skipping over datasets that are marked as disabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks pretty good! Thanks for sticking with this long-standing PR for so long 😅
I have a few clarifying questions to make sure the behavior is what we need before merging in.
}, | ||
}, | ||
) | ||
|
||
|
||
@pytest.fixture(scope="session") | ||
def ferc_to_sqlite_xbrl_only( | ||
def ferc1_xbrl_extract( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that sounds correct to me - though we should consider making sure we can't actually mutate anything in --live-dbs
mode....
Reworked datasette tests after the recent changes there, to allow for presence of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pushing this through @rousik ! Though there is still one merge conflict & a small typo re: a comment you had meant to remove.
There's definitely room for figuring out how we want to handle --live-dbs
and that is also way out of scope of this PR. If you have thoughts, @rousik, I'd love if you could write them up in a Discussion!
Merged |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @rousik note that we turned set up the merge queue last week, and the CI now only runs the integration tests in the merge queue.
src/pudl/ferc_to_sqlite/cli.py
Outdated
default=2, | ||
help=( | ||
"Number of worker processes to use when parsing XBRL filings. " | ||
"Defaults to using the number of CPUs." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the default here still supposed to be 2? If so the help message needs to be updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep the default as 0, which is what it was, and then force concurrency of 2 in conftest.py
. Also, we set --workers 8
in gcp_pudl_etl.sh
, so that is unaffected by this default.
…lt to 1. It didn't default to num CPUs anyways - we passed in 0 as default before, which would eventually make the batch size equal to the total # of filings, i.e. 1 concurrent worker.
c158319
to
a67a570
Compare
Fixed merge conflict & changed the default back to 0. Though I think that will actually throw a |
Presence of the monolithic
dbf2sqlite
andxbrl2sqlite
ops that processed all datasets in a serial loop seemed like a missed opportunity. This change rips out those monoliths and replaces them with dataset specific ops that are directly invoked from within the relevant graphs.One thing that prevented efficient breakup into smaller ops was a fact that these ops relied on configuration fields like
clobber
andbatch_size
, and this could fan out quite drastically. To prevent this, I have replaced the op-specific config withRuntimeSettings
resource that is defined once and for all and used by all the ops that need access to these parameters.This change:
ferc_to_sqlite
graph into dataset specific operations that are generated by relevant extractor classes--dataset-only
flag that can further restrict the graph to single dataset operation (this is going to be used in CI for sharding and speeding upferc_to_sqlite
)xbrl_enabled
/dbf_enabled
flags in the job factory in favor ofop_selector
logicdataset
anddata_format
is used in the op (so that they can be appropriately selected)