-
-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add metadata for ATB, EIA 930 and AEO data #3474
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could be wrong but I think the working_partitions
are only supposed to list those partitions which we are actually extracting and transforming and expect to work, which in the case of NREL ATB and EIA AEO I think will only be the most recent release.
src/pudl/metadata/sources.py
Outdated
"license_raw": LICENSES["us-govt"], | ||
"license_pudl": LICENSES["cc-by-4.0"], | ||
}, | ||
"eia_aeo": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ID is going to end up everywhere. Do we want to use eia_aeo
or eiaaeo
? Personally I think eia_bulk_elec
isn't great because it's different from all of the other data source IDs we use which are a single alphanumeric string like eia923
so this seems like a different format. Similar question with the nrel_atb
below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this more legible but I agree it's less consistent with how we've done things thus far. I'll drop the space.
src/pudl/metadata/sources.py
Outdated
}, | ||
"field_namespace": "nrel_atb", | ||
"working_partitions": { | ||
"years": sorted(set(range(2015, 2024))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only saw 2019-2023 in their cloud buckets as Parquet files. The older data is in a variety of different formats (like spreadsheets) and would have to be downloaded from other locations. I could see potentially archiving it but I doubt we'll want to do any transforms, and given the tight hours we've got I think probably we should just stick to archiving all the data in the current format for now, probably integrating only the most recent year of data initially, which I think means we just want years: [2023]
for the working_partitions
, right?
aws s3 ls --no-sign-request s3://oedi-data-lake/ATB/electricity/parquet/
src/pudl/metadata/sources.py
Outdated
}, | ||
"field_namespace": "eia", | ||
"working_partitions": { | ||
"years": sorted(set(range(2014, 2024))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the moment I think we'll probably only be extracting and working with the most recent (2023) data, although we can archive all of them. But IIRC the working_partitions
are the ones that are supposed to be ETL-able.
src/pudl/metadata/sources.py
Outdated
@@ -607,6 +691,41 @@ | |||
"license_raw": LICENSES["us-govt"], | |||
"license_pudl": LICENSES["cc-by-4.0"], | |||
}, | |||
"nrel_atb": { | |||
"title": "NREL Annual Technology Baseline (ATB)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NREL publishes an ATB for both Electricity and Transportation. It seems unlikely that we'll ever be working with the transportation data, but maybe it's worth noting that we're talking about the electricity data in the title and description.
"title": "NREL Annual Technology Baseline (ATB)", | |
"title": "NREL Annual Technology Baseline (ATB) for Electricity", |
src/pudl/metadata/sources.py
Outdated
"title": "NREL Annual Technology Baseline (ATB)", | ||
"path": "https://atb.nrel.gov/", | ||
"description": ( | ||
"The NREL Annual Technology Baseline (ATB) publishes annual projections of " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"The NREL Annual Technology Baseline (ATB) publishes annual projections of " | |
"The NREL Annual Technology Baseline (ATB) for Electricity publishes annual projections of " |
src/pudl/metadata/sources.py
Outdated
"half_year": [ | ||
f"{str(q).lower()}h{half}" | ||
for q in pd.period_range(start="2015", end="2023", freq="Y") | ||
for half in [1, 2] | ||
][1:-1] # Begins in H2 of 2015 and currently ends in H1 of 2024 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you meant to include all of 2023 and the first half of 2024, right? Sorry I didn't catch this before.
"half_year": [ | |
f"{str(q).lower()}h{half}" | |
for q in pd.period_range(start="2015", end="2023", freq="Y") | |
for half in [1, 2] | |
][1:-1] # Begins in H2 of 2015 and currently ends in H1 of 2024 | |
"half_year": [ | |
f"{year}h{half}" for year in range(2015, 2025) for half in [1, 2] | |
][1:-1] # Begins in H2 of 2015 and currently ends in H1 of 2024 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops yes good catch!
Overview
Closes #3473.
What problem does this address?
Adds new data sources to
pudl.metadata.sources.py
to enable archiving.What did you change?
Added new datasets to our sources.
Testing
How did you make sure this worked? How can a reviewer verify this?
Review existing docs and check links to make sure all information is correct.
To-do lis