Bridging our own ESG Data Gaps #30

MichaelTiemannOSC · 2021-11-18T13:59:28Z

Global Warming Potential (GWP) is an essential metric for any portfolio decarbonization decisions. There are many bottoms-up datasets that contain information about CO2e emissions, but it is difficult to judge their completeness. The ESSD dataset (https://essd.copernicus.org/articles/13/5213/2021/essd-13-5213-2021-assets.html) provides a peer-reviewed, global estimate of all GWP emissions from 1970 to the present day, with attribution to sub-region and sub-sector levels, according to IPCC taxonomies. 3 excel spreadsheets comprising 37MB of raw data (with about 800K rows of raw data) provides a good sense of the total scale of the GWP problem.

With that tops-down starting point, we can onboard bottoms-up data that fills in details as to specific sources on a country or regional basis, and also on a company, industry, sector, or super-sector basis. For example, the PUDL data provides detailed emissions data for US Utilities (including US territories and protectorates). Other datasets can be identified to fill gaps in other locations, across other industries, sectors, etc. But most importantly, with a tops-down view, we will have some idea how large of a gap a given dataset can fill in terms of GWP equations and analyses.

And of course, we want to tie any corporate ownership information to Legal Entity Identifiers, so that corporate hierarchies and connects with financial instruments can be understood.

A minimal satisfaction of this issue will therefore be:

It would be wonderful to ultimately be able to produce maps like this from our data: https://openghgmap.net/
(see also here: https://openghgmap.net/data/)

Should we also add climate data to the grid so that we can correlate temperature/percipitation/wind speed, etc. to macro-level effects like population change, energy use, etc? Sources referenced from this survey paper (https://www.nature.com/articles/s41597-020-00726-5) include WorldClim 2.1 (https://www.worldclim.org/data/worldclim21.html), which provides data from 1970-2000 and is the most cited source in the literature. WorldClim also offer monthly data that is more current (up through 2018), but only for a few variables (temp_min, temp_max, precipitation).

MichaelTiemannOSC · 2021-11-19T17:06:41Z

Created repository and committed changes for https://github.com/os-climate/essd-ingest-pipeline

MichaelTiemannOSC · 2021-11-19T18:56:06Z

@caldeirav @erikerlandson I still feel very unconfident about the metadata I'm collecting and storing. Would love to have your eyes on it and comments. Better documentation would lead to greater confidence, probably for many.

erikerlandson · 2021-11-19T20:24:57Z

Regarding UUID as a column:

if we set up uuid as one of the partition columns, it will not be stored as an actual column in the underlying parquet or orc files.
even if we do not use it as a partition column, the columnar-data compression formats should generate very high compression rates, since it is constant across many rows. This will particularly be true if we make sure it can leverage delta compression
https://github.com/apache/parquet-format/blob/master/Encodings.md

MichaelTiemannOSC · 2021-11-19T21:02:28Z

So it looks to me like I'm not storing UUID as a column, but rather as part of the parquet naming structure, hence it's a partition column. So I should probably change my comment from a question ("Do we really want to add UUID to every row, or better to put into table-level metadata?") to a statement ("Use UUID as one of our partition columns") and make it regular, non-bold text..

MichaelTiemannOSC · 2021-11-19T21:59:27Z

Regarding the onboarding of PUDL data, we have a bit of a strategic decision to make. What the PUDL environment provides is a convenient way for data scientists to easily access Utility info in a very Pandas-friendly way (using their own SQLalchemy connector to SQLite databases). It also creates an abstraction layer so users don't have to think about what data is in SQLite (all the plant/utility/generator/generation/fuel data) and parquet (all the emissions data). The pudl.sqlite database may be something we connect to Trino via our own SQLalchemy connector. We could of course explode the pudl.sqlite database into its constituent tables and create our own rendition of the data in Trino tables. The parquet files are of course very easy to integrate.

But...what sort of centralization and consistency do we want to impose on data sources in terms of running though Trino vs a concept of Federation that could be as wide-open as saying pip install pudl (they highly encourage using conda) and filling one's dataframes from a data source that draws from the Data Commons but not from Trino per se?

erikerlandson · 2021-11-19T22:43:40Z

connecting trino directly to sqlite would be nice. I think if they ever get JDBC connector working this will become easy:
trinodb/trino#3105

Failing that, my concern with just exposing a deployment of pip install pudl is that it is higher friction to federate with anything on trino, although one could do the federation in "pandas space"

When we meet with the PUDL people, I'd like to discuss supporting some other strategic DBs, and ideally a single unified SQL interface to their entire set (for example, MySQL, PostgreSQL, BigQuery, might be interesting, and are already supported by trino). In such a scenario, we could run PUDL as a microservice and configure a trino connector to it.

erikerlandson · 2021-11-19T22:45:14Z

We could certainly write an ingest pipeline that consumes both their parquet and SqLite and injects both into trino. This seems less elegant but it would be effective

MichaelTiemannOSC · 2021-11-20T00:35:04Z

usql seems to be able to bridge from mysql to trino. Wonder why Trino cannot do it directly.

https://golangrepo.com/repo/xo-usql-go-database-tools

caldeirav · 2021-11-20T08:53:03Z

On the usage of UUID: in my latest ingestion demo sample (leveraging Iceberg), UUID is used for partitioning data at the data set level. Design-wise I feel it makes sense because we can only do delete based on partition elements. Could have used timestamp as well but timestamp is already captured by Iceberg in the snapshot table and may not be consistent for all INSERT into the table.

MichaelTiemannOSC · 2021-11-20T10:37:17Z

To wit...

schema = create_table_schema_pairs(df)
tabledef = """
create table if not exists osc_datacommons_iceberg_dev.{sname}.{tname} (
{schema}
) with (
    format = 'parquet',
    partitioning = ARRAY['uuid']
)
""".format(schema=schema, sname=schemaname, tname=tablename)
print(tabledef)

Thanks!

erikerlandson · 2021-11-20T13:37:05Z

usql seems to be able to bridge from mysql to trino. Wonder why Trino cannot do it directly.

I do not understand why the trino dev community has not given a generalized JDBC connector higher priority, since it would allow trino to connect to practically any db that exposes an SQL interface, probably including other trino dbs.

MichaelTiemannOSC · 2021-11-20T15:25:18Z

What they explain about halfway through the issue you posted is that they disbelieve that general solutions can be adequately tested. Instead, they put their faith in rigorous testing of specific connectors. Now...why there is no support for (super-popular) sqlite is a question they don't answer.

MichaelTiemannOSC · 2021-11-21T11:16:26Z

@HeatherAck -- your encouraging response led me to flesh this out a bit more, and I think it's turning into a project, but a very doable one, since all the data has been uploaded. It thus may truly qualify as a worthy sprint. As elaborated above, these datasets connect both with GLEIF and with each other, and by making these connections, we are well on our way to having enriched data. For example, a lack in the current RMI data is "Enterprise Value + Cash" but these are numbers that can be pulled from the SEC DARA data for publicly traded companies. Indeed, we now have richly described public financial data to go along with granular environmental data within a context of global emissions (and potentially climate) data.

If we could get this in shape by Dec 1st, it will give the Data Commons gigabytes of heft.

caldeirav · 2021-11-22T13:44:12Z

Couple of notes here:

The ingestion model via Iceberg connector is in my view more scalable as does not require file / file structure handling which can lead to inconsistencies. Also, we could easily build our own libs to facilitate handling of ingestion via standardised SQL-based operations (but ideally using the dataframe.to_sql functions will be preferred).
However as we start to really industrialise data ingestion, we need to take a decision soon as likely data would have to be re=processed if we switch along the way (migration is cumbersome).

caldeirav · 2021-11-22T13:45:03Z

Also I am starting a small POC now for metadata handling on visualisation / client side. Need to complete this before finalising metadata handling and structure.

MichaelTiemannOSC · 2021-11-22T14:03:41Z

Here's an annotated picture showing both how things fit together and also a sense of how many records we are attempting to fit together. Note that the 3K owners referenced in PUDL does not scratch the surface of the millions of rows of emissions data also in the PUDL database, for example.

@ludans @DavWinkel @ChristianMeyndt @ttschmucker @toki8

erikerlandson · 2021-11-24T14:52:15Z

xref trinodb/trino#10060 for SQLite connector support

MichaelTiemannOSC · 2021-11-24T17:14:58Z

Just checking in with the fact that my LEI matcher matches 3966 of the 5866 DERA companies, a 68% match rate. There are about 1890 that don't yet match, but the ones I hand-checked have no matches on the GLEIF website, so I suspect we are suffering from a lack of LEI coverage.

(I previously reported much larger numbers in all categories, but this is because I forgot to run sort prior to uniq when generating my input to the matcher.)

MichaelTiemannOSC · 2021-11-25T02:37:55Z

Alas, PUDL requires SQLalchemy version 1.4 to do some of its things (it can do a few things fine without it). But our Trino environment requires SQLalchemy version 1.3 in order to function at all. So PUDL and Trino are not good bunkmates right now.

MichaelTiemannOSC · 2021-11-29T01:07:02Z

I've checked in a demo notebook that begins to use the connections possible in the above diagram. I've also left a comment in the code that there is a LONG way to go before queries that seem simple are simple.

https://github.com/os-climate/data-platform-demo/blob/master/notebooks/EPA_GHGRP-demos.ipynb

MichaelTiemannOSC · 2021-12-07T21:03:13Z

PUDL data integration may depend on this: catalyst-cooperative/pudl#1361

erikerlandson · 2021-12-07T21:25:50Z

I just ran some operations using the latest sqlalchemy and sqlalchemy-trino (pip install --upgrade ...), and it worked. This includes sqlalchemy==1.4.27 and sqlalchemy-trino==0.4.1

MichaelTiemannOSC · 2021-12-07T21:55:03Z

That's cheating 8-)

We do still have the question about onboarding data more deeply into the Data Commons.

MichaelTiemannOSC · 2021-12-27T13:59:32Z

Added ISO3166 data by uploading PyCountry data to Trino in a new data pipeline: https://github.com/os-climate/iso3166-ingest

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bridging our own ESG Data Gaps #30

Bridging our own ESG Data Gaps #30

MichaelTiemannOSC commented Nov 18, 2021 •

edited

Loading

MichaelTiemannOSC commented Nov 19, 2021

MichaelTiemannOSC commented Nov 19, 2021

erikerlandson commented Nov 19, 2021

MichaelTiemannOSC commented Nov 19, 2021

MichaelTiemannOSC commented Nov 19, 2021

erikerlandson commented Nov 19, 2021

erikerlandson commented Nov 19, 2021

MichaelTiemannOSC commented Nov 20, 2021 •

edited

Loading

caldeirav commented Nov 20, 2021

MichaelTiemannOSC commented Nov 20, 2021

erikerlandson commented Nov 20, 2021

MichaelTiemannOSC commented Nov 20, 2021

MichaelTiemannOSC commented Nov 21, 2021

caldeirav commented Nov 22, 2021

caldeirav commented Nov 22, 2021

MichaelTiemannOSC commented Nov 22, 2021

erikerlandson commented Nov 24, 2021

MichaelTiemannOSC commented Nov 24, 2021 •

edited

Loading

MichaelTiemannOSC commented Nov 25, 2021

MichaelTiemannOSC commented Nov 29, 2021

MichaelTiemannOSC commented Dec 7, 2021

erikerlandson commented Dec 7, 2021

MichaelTiemannOSC commented Dec 7, 2021

MichaelTiemannOSC commented Dec 27, 2021

Bridging our own ESG Data Gaps #30

Bridging our own ESG Data Gaps #30

Comments

MichaelTiemannOSC commented Nov 18, 2021 • edited Loading

MichaelTiemannOSC commented Nov 19, 2021

MichaelTiemannOSC commented Nov 19, 2021

erikerlandson commented Nov 19, 2021

MichaelTiemannOSC commented Nov 19, 2021

MichaelTiemannOSC commented Nov 19, 2021

erikerlandson commented Nov 19, 2021

erikerlandson commented Nov 19, 2021

MichaelTiemannOSC commented Nov 20, 2021 • edited Loading

caldeirav commented Nov 20, 2021

MichaelTiemannOSC commented Nov 20, 2021

erikerlandson commented Nov 20, 2021

MichaelTiemannOSC commented Nov 20, 2021

MichaelTiemannOSC commented Nov 21, 2021

caldeirav commented Nov 22, 2021

caldeirav commented Nov 22, 2021

MichaelTiemannOSC commented Nov 22, 2021

erikerlandson commented Nov 24, 2021

MichaelTiemannOSC commented Nov 24, 2021 • edited Loading

MichaelTiemannOSC commented Nov 25, 2021

MichaelTiemannOSC commented Nov 29, 2021

MichaelTiemannOSC commented Dec 7, 2021

erikerlandson commented Dec 7, 2021

MichaelTiemannOSC commented Dec 7, 2021

MichaelTiemannOSC commented Dec 27, 2021

MichaelTiemannOSC commented Nov 18, 2021 •

edited

Loading

MichaelTiemannOSC commented Nov 20, 2021 •

edited

Loading

MichaelTiemannOSC commented Nov 24, 2021 •

edited

Loading