Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bridging our own ESG Data Gaps #30

Open
12 of 14 tasks
MichaelTiemannOSC opened this issue Nov 18, 2021 · 24 comments
Open
12 of 14 tasks

Bridging our own ESG Data Gaps #30

MichaelTiemannOSC opened this issue Nov 18, 2021 · 24 comments

Comments

@MichaelTiemannOSC
Copy link
Contributor

MichaelTiemannOSC commented Nov 18, 2021

Global Warming Potential (GWP) is an essential metric for any portfolio decarbonization decisions. There are many bottoms-up datasets that contain information about CO2e emissions, but it is difficult to judge their completeness. The ESSD dataset (https://essd.copernicus.org/articles/13/5213/2021/essd-13-5213-2021-assets.html) provides a peer-reviewed, global estimate of all GWP emissions from 1970 to the present day, with attribution to sub-region and sub-sector levels, according to IPCC taxonomies. 3 excel spreadsheets comprising 37MB of raw data (with about 800K rows of raw data) provides a good sense of the total scale of the GWP problem.

With that tops-down starting point, we can onboard bottoms-up data that fills in details as to specific sources on a country or regional basis, and also on a company, industry, sector, or super-sector basis. For example, the PUDL data provides detailed emissions data for US Utilities (including US territories and protectorates). Other datasets can be identified to fill gaps in other locations, across other industries, sectors, etc. But most importantly, with a tops-down view, we will have some idea how large of a gap a given dataset can fill in terms of GWP equations and analyses.

And of course, we want to tie any corporate ownership information to Legal Entity Identifiers, so that corporate hierarchies and connects with financial instruments can be understood.

A minimal satisfaction of this issue will therefore be:

  • onboarding of ESSD data (s3://redhat-osc-physical-landing-647521352890/ESSD/).
  • onboarding of PUDL data (relatively simple if we "federate at the Dataframe layer") (s3://redhat-osc-physical-landing-647521352890/PUDL/pudl-v0.5.0/).
  • enrichment of PUDL data with GLEIF data (currently matching 1215 of 3334 entities in their database).
  • connection between PUDL data and ESSD data (consistent country/region information, gas species information, etc).
  • go beyond powerplant data (esp. for Physical Risk) by adding steel and cement plant data: s3://redhat-osc-physical-landing-647521352890/SFI_GeoAsset/
  • onboarding of SEC DERA data (public company filings that include stock ticker and SIC code info; s3://redhat-osc-physical-landing-647521352890/SEC-DERA/).
  • connection between DERA data and GLEIF.
  • use of EPA National Combined data to link Entity Names with NAICS and SIC codes (trivially captures industry classification of major emitters like Utilities, Steel and Cement, Transportation, etc.). DERA data only covers publicly traded companies; EPA data captures and codes many private companies (who affect the environment). See s3://redhat-osc-physical-landing-647521352890/EPA/national_combined-20211104/.
  • Onboard GHGRP data (s3://redhat-osc-physical-landing-647521352890/EPA/ghgp_data_parent_company_10_2021.xlsx and others).
  • Connection of EPA GHGRP and SEC DERA data to each other and with GLEIF to map LEIs to industry/sector data.
  • Connection of EPA FRS and SEC DERA data to each other via EIN (!)
  • Upload WorldClim 10m dataset s3://redhat-osc-physical-landing-647521352890/WorldClim/wc2.1_10m/
  • Onboard 2017 US Census all-sector data (s3://redhat-osc-physical-landing-647521352890/US_CENSUS/US_Census-All-Sector-Report-2017/)
  • Align with ISO 3166-2 using https://github.com/esosedi/3166

It would be wonderful to ultimately be able to produce maps like this from our data: https://openghgmap.net/
(see also here: https://openghgmap.net/data/)

Should we also add climate data to the grid so that we can correlate temperature/percipitation/wind speed, etc. to macro-level effects like population change, energy use, etc? Sources referenced from this survey paper (https://www.nature.com/articles/s41597-020-00726-5) include WorldClim 2.1 (https://www.worldclim.org/data/worldclim21.html), which provides data from 1970-2000 and is the most cited source in the literature. WorldClim also offer monthly data that is more current (up through 2018), but only for a few variables (temp_min, temp_max, precipitation).

@MichaelTiemannOSC
Copy link
Contributor Author

Created repository and committed changes for https://github.com/os-climate/essd-ingest-pipeline

@MichaelTiemannOSC
Copy link
Contributor Author

@caldeirav @erikerlandson I still feel very unconfident about the metadata I'm collecting and storing. Would love to have your eyes on it and comments. Better documentation would lead to greater confidence, probably for many.

@erikerlandson
Copy link
Contributor

Regarding UUID as a column:

  1. if we set up uuid as one of the partition columns, it will not be stored as an actual column in the underlying parquet or orc files.
  2. even if we do not use it as a partition column, the columnar-data compression formats should generate very high compression rates, since it is constant across many rows. This will particularly be true if we make sure it can leverage delta compression
    https://github.com/apache/parquet-format/blob/master/Encodings.md

@MichaelTiemannOSC
Copy link
Contributor Author

So it looks to me like I'm not storing UUID as a column, but rather as part of the parquet naming structure, hence it's a partition column. So I should probably change my comment from a question ("Do we really want to add UUID to every row, or better to put into table-level metadata?") to a statement ("Use UUID as one of our partition columns") and make it regular, non-bold text..

@MichaelTiemannOSC
Copy link
Contributor Author

Regarding the onboarding of PUDL data, we have a bit of a strategic decision to make. What the PUDL environment provides is a convenient way for data scientists to easily access Utility info in a very Pandas-friendly way (using their own SQLalchemy connector to SQLite databases). It also creates an abstraction layer so users don't have to think about what data is in SQLite (all the plant/utility/generator/generation/fuel data) and parquet (all the emissions data). The pudl.sqlite database may be something we connect to Trino via our own SQLalchemy connector. We could of course explode the pudl.sqlite database into its constituent tables and create our own rendition of the data in Trino tables. The parquet files are of course very easy to integrate.

But...what sort of centralization and consistency do we want to impose on data sources in terms of running though Trino vs a concept of Federation that could be as wide-open as saying pip install pudl (they highly encourage using conda) and filling one's dataframes from a data source that draws from the Data Commons but not from Trino per se?

@erikerlandson
Copy link
Contributor

connecting trino directly to sqlite would be nice. I think if they ever get JDBC connector working this will become easy:
trinodb/trino#3105

Failing that, my concern with just exposing a deployment of pip install pudl is that it is higher friction to federate with anything on trino, although one could do the federation in "pandas space"

When we meet with the PUDL people, I'd like to discuss supporting some other strategic DBs, and ideally a single unified SQL interface to their entire set (for example, MySQL, PostgreSQL, BigQuery, might be interesting, and are already supported by trino). In such a scenario, we could run PUDL as a microservice and configure a trino connector to it.

@erikerlandson
Copy link
Contributor

We could certainly write an ingest pipeline that consumes both their parquet and SqLite and injects both into trino. This seems less elegant but it would be effective

@MichaelTiemannOSC
Copy link
Contributor Author

MichaelTiemannOSC commented Nov 20, 2021

usql seems to be able to bridge from mysql to trino. Wonder why Trino cannot do it directly.

https://golangrepo.com/repo/xo-usql-go-database-tools

@caldeirav
Copy link
Contributor

On the usage of UUID: in my latest ingestion demo sample (leveraging Iceberg), UUID is used for partitioning data at the data set level. Design-wise I feel it makes sense because we can only do delete based on partition elements. Could have used timestamp as well but timestamp is already captured by Iceberg in the snapshot table and may not be consistent for all INSERT into the table.

@MichaelTiemannOSC
Copy link
Contributor Author

To wit...

schema = create_table_schema_pairs(df)
tabledef = """
create table if not exists osc_datacommons_iceberg_dev.{sname}.{tname} (
{schema}
) with (
    format = 'parquet',
    partitioning = ARRAY['uuid']
)
""".format(schema=schema, sname=schemaname, tname=tablename)
print(tabledef)

Thanks!

@erikerlandson
Copy link
Contributor

usql seems to be able to bridge from mysql to trino. Wonder why Trino cannot do it directly.

I do not understand why the trino dev community has not given a generalized JDBC connector higher priority, since it would allow trino to connect to practically any db that exposes an SQL interface, probably including other trino dbs.

@MichaelTiemannOSC
Copy link
Contributor Author

What they explain about halfway through the issue you posted is that they disbelieve that general solutions can be adequately tested. Instead, they put their faith in rigorous testing of specific connectors. Now...why there is no support for (super-popular) sqlite is a question they don't answer.

@MichaelTiemannOSC
Copy link
Contributor Author

@HeatherAck -- your encouraging response led me to flesh this out a bit more, and I think it's turning into a project, but a very doable one, since all the data has been uploaded. It thus may truly qualify as a worthy sprint. As elaborated above, these datasets connect both with GLEIF and with each other, and by making these connections, we are well on our way to having enriched data. For example, a lack in the current RMI data is "Enterprise Value + Cash" but these are numbers that can be pulled from the SEC DARA data for publicly traded companies. Indeed, we now have richly described public financial data to go along with granular environmental data within a context of global emissions (and potentially climate) data.

If we could get this in shape by Dec 1st, it will give the Data Commons gigabytes of heft.

@caldeirav
Copy link
Contributor

Couple of notes here:

  • The ingestion model via Iceberg connector is in my view more scalable as does not require file / file structure handling which can lead to inconsistencies. Also, we could easily build our own libs to facilitate handling of ingestion via standardised SQL-based operations (but ideally using the dataframe.to_sql functions will be preferred).
  • However as we start to really industrialise data ingestion, we need to take a decision soon as likely data would have to be re=processed if we switch along the way (migration is cumbersome).

@caldeirav
Copy link
Contributor

Also I am starting a small POC now for metadata handling on visualisation / client side. Need to complete this before finalising metadata handling and structure.

@MichaelTiemannOSC
Copy link
Contributor Author

Here's an annotated picture showing both how things fit together and also a sense of how many records we are attempting to fit together. Note that the 3K owners referenced in PUDL does not scratch the surface of the millions of rows of emissions data also in the PUDL database, for example.

Data Commons November Sprint

@ludans @DavWinkel @ChristianMeyndt @ttschmucker @toki8

@erikerlandson
Copy link
Contributor

xref trinodb/trino#10060 for SQLite connector support

@MichaelTiemannOSC
Copy link
Contributor Author

MichaelTiemannOSC commented Nov 24, 2021

Just checking in with the fact that my LEI matcher matches 3966 of the 5866 DERA companies, a 68% match rate. There are about 1890 that don't yet match, but the ones I hand-checked have no matches on the GLEIF website, so I suspect we are suffering from a lack of LEI coverage.

(I previously reported much larger numbers in all categories, but this is because I forgot to run sort prior to uniq when generating my input to the matcher.)

@MichaelTiemannOSC
Copy link
Contributor Author

Alas, PUDL requires SQLalchemy version 1.4 to do some of its things (it can do a few things fine without it). But our Trino environment requires SQLalchemy version 1.3 in order to function at all. So PUDL and Trino are not good bunkmates right now.

@MichaelTiemannOSC
Copy link
Contributor Author

I've checked in a demo notebook that begins to use the connections possible in the above diagram. I've also left a comment in the code that there is a LONG way to go before queries that seem simple are simple.

https://github.com/os-climate/data-platform-demo/blob/master/notebooks/EPA_GHGRP-demos.ipynb

@MichaelTiemannOSC
Copy link
Contributor Author

PUDL data integration may depend on this: catalyst-cooperative/pudl#1361

@erikerlandson
Copy link
Contributor

I just ran some operations using the latest sqlalchemy and sqlalchemy-trino (pip install --upgrade ...), and it worked. This includes sqlalchemy==1.4.27 and sqlalchemy-trino==0.4.1

@MichaelTiemannOSC
Copy link
Contributor Author

That's cheating 8-)

We do still have the question about onboarding data more deeply into the Data Commons.

@MichaelTiemannOSC
Copy link
Contributor Author

Added ISO3166 data by uploading PyCountry data to Trino in a new data pipeline: https://github.com/os-climate/iso3166-ingest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants