Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add aggregate_by_weighted_sum to geomapper #1960

Merged
merged 9 commits into from
May 7, 2024
31 changes: 11 additions & 20 deletions _delphi_utils_python/data_proc/geomap/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Geocoding data processing pipeline
# Geocoding Data Processing

Authors: Jingjing Tang, James Sharpnack, Dmitry Shemetov

Expand All @@ -7,42 +7,37 @@ Authors: Jingjing Tang, James Sharpnack, Dmitry Shemetov
Requires the following source files below.

Run the following to build the crosswalk tables in `covidcast-indicators/_delph_utils_python/delph_utils/data`
```

```sh
$ python geo_data_proc.py
```

You can see consistency checks and diffs with old sources in ./consistency_checks.ipynb
Find data consistency checks in `./source-file-sanity-check.ipynb`.

## Geo Codes

We support the following geocodes.

- The ZIP code and the FIPS code are the most granular geocodes we support.
- The [ZIP code](https://en.wikipedia.org/wiki/ZIP_Code) is a US postal code used by the USPS and the [FIPS code](https://en.wikipedia.org/wiki/FIPS_county_code) is an identifier for US counties and other associated territories. The ZIP code is five digit code (with leading zeros).
- The FIPS code is a five digit code (with leading zeros), where the first two digits are a two-digit state code and the last three are a three-digit county code (see this [US Census Bureau page](https://www.census.gov/library/reference/code-lists/ansi.html) for detailed information).
- The Metropolitan Statistical Area (MSA) code refers to regions around cities (these are sometimes referred to as CBSA codes). More information on these can be found at the [US Census Bureau](https://www.census.gov/programs-surveys/metro-micro/about.html).
- We are reserving 10001-10099 for states codes of the form 100XX where XX is the FIPS code for the state (the current smallest CBSA is 10100). In the case that the CBSA codes change then it should be verified that these are not used.
- The [ZIP code](https://en.wikipedia.org/wiki/ZIP_Code) is a US postal code used by the USPS and the [FIPS code](https://en.wikipedia.org/wiki/FIPS_county_code) is an identifier for US counties and other associated territories. The ZIP code is five digit code (with leading zeros).
- The FIPS code is a five digit code (with leading zeros), where the first two digits are a two-digit state code and the last three are a three-digit county code (see this [US Census Bureau page](https://www.census.gov/library/reference/code-lists/ansi.html) for detailed information).
- The Metropolitan Statistical Area (MSA) code refers to regions around cities (these are sometimes referred to as CBSA codes). More information on these can be found at the [US Census Bureau](https://www.census.gov/programs-surveys/metro-micro/about.html). We rserve 10001-10099 for states codes of the form 100XX where XX is the FIPS code for the state (the current smallest CBSA is 10100). In the case that the CBSA codes change then it should be verified that these are not used.
- State codes are a series of equivalent identifiers for US state. They include the state name, the state number (state_id), and the state two-letter abbreviation (state_code). The state number is the state FIPS code. See [here](https://en.wikipedia.org/wiki/List_of_U.S._state_and_territory_abbreviations) for more.
- The Hospital Referral Region (HRR) and the Hospital Service Area (HSA). More information [here](https://www.dartmouthatlas.org/covid-19/hrr-mapping/).
FIPS codes depart in some special cases, so we produce manual changes listed below.

## Source files
## Source Files

The source files are requested from a government URL when `geo_data_proc.py` is run (see the top of said script for the URLs). Below we describe the locations to find updated versions of the source files, if they are ever needed.

- ZIP -> FIPS (county) population tables available from [US Census](https://www.census.gov/geographies/reference-files/time-series/geo/relationship-files.html#par_textimage_674173622). This file contains the population of the intersections between ZIP and FIPS regions, allowing the creation of a population-weighted transform between the two. As of 4 February 2022, this source did not include population information for 24 ZIPs that appear in our indicators. We have added those values manually using information available from the [zipdatamaps website](www.zipdatamaps.com).
- ZIP -> HRR -> HSA crosswalk file comes from the 2018 version at the [Dartmouth Atlas Project](https://atlasdata.dartmouth.edu/static/supp_research_data).
- FIPS -> MSA crosswalk file comes from the September 2018 version of the delineation files at the [US Census Bureau](https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/delineation-files.html).
- State Code -> State ID -> State Name comes from the ANSI standard at the [US Census](https://www.census.gov/library/reference/code-lists/ansi.html#par_textimage_3). The first two digits of a FIPS codes should match the state code here.
- State Code -> State ID -> State Name comes from the ANSI standard at the [US Census](https://www.census.gov/library/reference/code-lists/ansi.html#par_textimage_3).


## Derived files
## Derived Files

The rest of the crosswalk tables are derived from the mappings above. We provide crosswalk functions from granular to coarser codes, but not the other way around. This is because there is no information gained when crosswalking from coarse to granular.



## Deprecated source files
## Deprecated Source Files

- ZIP to FIPS to HRR to states: `02_20_uszips.csv` comes from a version of the table [here](https://simplemaps.com/data/us-zips) modified by Jingjing to include population weights.
- The `02_20_uszips.csv` file is based on the newest consensus data including 5-digit zipcode, fips code, county name, state, population, HRR, HSA (I downloaded the original file from [here](https://simplemaps.com/data/us-zips). This file matches best to the most recent (2020) situation in terms of the population. But there still exist some matching problems. I manually checked and corrected those lines (~20) with [zip-codes](https://www.zip-codes.com/zip-code/58439/zip-code-58439.asp). The mapping from 5-digit zipcode to HRR is based on the file in 2017 version downloaded from [here](https://atlasdata.dartmouth.edu/static/supp_research_data).
Expand All @@ -51,7 +46,3 @@ The rest of the crosswalk tables are derived from the mappings above. We provide
- CBSA -> FIPS crosswalk from [here](https://data.nber.org/data/cbsa-fips-county-crosswalk.html) (the file is `cbsatocountycrosswalk.csv`).
- MSA tables from March 2020 [here](https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/delineation-files.html). This file seems to differ in a few fips codes from the source for the 02_20_uszip file which Jingjing constructed. There are at least 10 additional fips in 03_20_msa that are not in the uszip file, and one of the msa codes seems to be incorrect: 49020 (a google search confirms that it is incorrect in uszip and correct in the census data).
- MSA tables from 2019 [here](https://apps.bea.gov/regional/docs/msalist.cfm)

## Notes

- The NAs in the coding currently zero-fills.
8 changes: 2 additions & 6 deletions _delphi_utils_python/data_proc/geomap/geo_data_proc.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,7 @@
"""
Authors: Dmitry Shemetov @dshemetov, James Sharpnack @jsharpna

Intended execution:
Authors: Dmitry Shemetov, James Sharpnack

cd _delphi_utils/data_proc/geomap
chmod u+x geo_data_proc.py
python geo_data_proc.py
"""

Expand All @@ -19,7 +16,7 @@


# Source files
YEAR = 2019
YEAR = 2020
INPUT_DIR = "./old_source_files"
OUTPUT_DIR = f"../../delphi_utils/data/{YEAR}"
FIPS_BY_ZIP_POP_URL = "https://www2.census.gov/geo/docs/maps-data/data/rel/zcta_county_rel_10.txt?#"
Expand All @@ -41,7 +38,6 @@
FIPS_HHS_FILENAME = "fips_hhs_table.csv"
FIPS_CHNGFIPS_OUT_FILENAME = "fips_chng-fips_table.csv"
FIPS_POPULATION_OUT_FILENAME = "fips_pop.csv"

CHNGFIPS_STATE_OUT_FILENAME = "chng-fips_state_table.csv"
ZIP_HSA_OUT_FILENAME = "zip_hsa_table.csv"
ZIP_HRR_OUT_FILENAME = "zip_hrr_table.csv"
Expand Down
179 changes: 121 additions & 58 deletions _delphi_utils_python/delphi_utils/geomap.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,54 +18,90 @@ class GeoMapper: # pylint: disable=too-many-public-methods

The GeoMapper class provides utility functions for translating between different
geocodes. Supported geocodes:
- zip: zip5, a length 5 str of 0-9 with leading 0's
- fips: state code and county code, a length 5 str of 0-9 with leading 0's
- msa: metropolitan statistical area, a length 5 str of 0-9 with leading 0's
- state_code: state code, a str of 0-9
- state_id: state id, a str of A-Z
- hrr: hospital referral region, an int 1-500

Mappings:
- [x] zip -> fips : population weighted
- [x] zip -> hrr : unweighted
- [x] zip -> msa : unweighted
- [x] zip -> state
- [x] zip -> hhs
- [x] zip -> population
- [x] state code -> hhs
- [x] fips -> state : unweighted
- [x] fips -> msa : unweighted
- [x] fips -> megacounty
- [x] fips -> hrr
- [x] fips -> hhs
- [x] fips -> chng-fips
- [x] chng-fips -> state : unweighted
- [x] nation
- [ ] zip -> dma (postponed)

The GeoMapper instance loads crosswalk tables from the package data_dir. The
crosswalk tables are assumed to have been built using the geo_data_proc.py script
in data_proc/geomap. If a mapping between codes is NOT one to many, then the table has
just two colums. If the mapping IS one to many, then a third column, the weight column,
exists (e.g. zip, fips, weight; satisfying (sum(weights) where zip==ZIP) == 1).

- zip: five characters [0-9] with leading 0's, e.g. "33626"
also known as zip5 or zip code
- fips: five characters [0-9] with leading 0's, e.g. "12057"
the first two digits are the state FIPS code and the last
three are the county FIPS code
- msa: five characters [0-9] with leading 0's, e.g. "90001"
also known as metropolitan statistical area
- state_code: two characters [0-9], e.g "06"
- state_id: two characters [A-Z], e.g "CA"
- state_name: human-readable name, e.g "California"
dshemetov marked this conversation as resolved.
Show resolved Hide resolved
- state_*: we use this below to refer to the three above geocodes in aggregate
- hrr: an integer from 1-500, also known as hospital
referral region
- hhs: an integer from 1-10, also known as health and human services region
https://www.hhs.gov/about/agencies/iea/regional-offices/index.html

Valid mappings:

From To Population Weighted
zip fips Yes
zip hrr No
zip msa Yes
zip state_* Yes
dshemetov marked this conversation as resolved.
Show resolved Hide resolved
zip hhs Yes
zip population --
zip nation No
state_* state_* No
state_* hhs No
state_* population --
state_* nation No
fips state_* No
fips msa No
fips megacounty No
fips hrr Yes
fips hhs No
fips chng-fips No
fips nation No
chng-fips state_* No

Crosswalk Tables
================

The GeoMapper instance loads pre-generated crosswalk tables (built by the
script in `data_proc/geomap/geo_data_proc.py`). If a mapping between codes
is one to one or many to one, then the table has just two columns. If the
mapping is one to many, then a weight column is provided, which gives the
fractional population contribution of a source_geo to the target_geo. The
weights satisfy the condition that df.groupby(from_code).sum(weight) == 1.0
for all values of from_code.

Aggregation
===========

The GeoMapper class provides functions to aggregate data from one geocode
to another. The aggregation can be a simple one-to-one mapping or a
weighted aggregation. The weighted aggregation is useful when the data
being aggregated is a population-weighted quantity, such as visits or
cases. The aggregation is done by multiplying the data columns by the
weights and summing over the data columns. Note that the aggregation does
not adjust the aggregation for missing or NA values in the data columns,
which is equivalent to a zero-fill.

Example Usage
==========
=============
The main GeoMapper object loads and stores crosswalk dataframes on-demand.

When replacing geocodes with a new one an aggregation step is performed on the data columns
to merge entries (i.e. in the case of a many to one mapping or a weighted mapping). This
requires a specification of the data columns, which are assumed to be all the columns that
are not the geocodes or the date column specified in date_col.
When replacing geocodes with a new one an aggregation step is performed on
the data columns to merge entries (i.e. in the case of a many to one
mapping or a weighted mapping). This requires a specification of the data
columns, which are assumed to be all the columns that are not the geocodes
or the date column specified in date_col.

Example 1: to add a new column with a new geocode, possibly with weights:
> gmpr = GeoMapper()
> df = gmpr.add_geocode(df, "fips", "zip", from_col="fips", new_col="geo_id",
> df = gmpr.add_geocode(df, "fips", "zip",
from_col="fips", new_col="geo_id",
date_col="timestamp", dropna=False)

Example 2: to replace a geocode column with a new one, aggregating the data with weights:
Example 2: to replace a geocode column with a new one, aggregating the data
with weights:
> gmpr = GeoMapper()
> df = gmpr.replace_geocode(df, "fips", "zip", from_col="fips", new_col="geo_id",
> df = gmpr.replace_geocode(df, "fips", "zip",
from_col="fips", new_col="geo_id",
date_col="timestamp", dropna=False)
"""

Expand Down Expand Up @@ -113,7 +149,7 @@ def __init__(self, census_year: int = 2020):
subkey
for mainkey in self.CROSSWALK_FILENAMES
for subkey in self.CROSSWALK_FILENAMES[mainkey]
}.union(set(self.CROSSWALK_FILENAMES.keys())) - set(["state", "pop"])
}.union(set(self.CROSSWALK_FILENAMES.keys())) - {"state", "pop"}

for from_code, to_codes in self.CROSSWALK_FILENAMES.items():
for to_code, file_path in to_codes.items():
Expand All @@ -135,7 +171,6 @@ def _load_crosswalk_from_file(
"weight": float,
**{geo: str for geo in self._geos - set("nation")},
}

usecols = [from_code, "pop"] if to_code == "pop" else None
return pd.read_csv(stream, dtype=dtype, usecols=usecols)

Expand Down Expand Up @@ -229,13 +264,6 @@ def add_geocode(
):
"""Add a new geocode column to a dataframe.
dshemetov marked this conversation as resolved.
Show resolved Hide resolved

Currently supported conversions:
- fips -> state_code, state_id, state_name, zip, msa, hrr, nation, hhs, chng-fips
- chng-fips -> state_code, state_id, state_name
- zip -> state_code, state_id, state_name, fips, msa, hrr, nation, hhs
- state_x -> state_y (where x and y are in {code, id, name}), nation
- state_code -> hhs, nation

Parameters
---------
df: pd.DataFrame
Expand Down Expand Up @@ -303,7 +331,7 @@ def add_geocode(
df = df.merge(crosswalk, left_on=from_col, right_on=from_col, how="left")

# Drop extra state columns
if new_code in state_codes and not from_code in state_codes:
if new_code in state_codes and from_code not in state_codes:
state_codes.remove(new_code)
df.drop(columns=state_codes, inplace=True)
elif new_code in state_codes and from_code in state_codes:
Expand Down Expand Up @@ -345,13 +373,6 @@ def replace_geocode(
) -> pd.DataFrame:
"""Replace a geocode column in a dataframe.
dshemetov marked this conversation as resolved.
Show resolved Hide resolved

Currently supported conversions:
- fips -> chng-fips, state_code, state_id, state_name, zip, msa, hrr, nation
- chng-fips -> state_code, state_id, state_name
- zip -> state_code, state_id, state_name, fips, msa, hrr, nation
- state_x -> state_y (where x and y are in {code, id, name}), nation
- state_code -> hhs, nation

Parameters
---------
df: pd.DataFrame
Expand Down Expand Up @@ -397,7 +418,7 @@ def replace_geocode(
df[data_cols] = df[data_cols].multiply(df["weight"], axis=0)
df.drop("weight", axis=1, inplace=True)

if not date_col is None:
if date_col is not None:
df = df.groupby([date_col, new_col]).sum(numeric_only=True).reset_index()
else:
df = df.groupby([new_col]).sum(numeric_only=True).reset_index()
Expand Down Expand Up @@ -575,8 +596,7 @@ def get_geos_within(
Return all contained regions of the given type within the given container geocode.

Given container_geocode (e.g "ca" for California) of type container_geocode_type
(e.g "state"), return:
- all (contained_geocode_type)s within container_geocode
(e.g "state"), return all (contained_geocode_type)s within container_geocode.

Supports these 4 combinations:
- all states within a nation
Expand Down Expand Up @@ -627,3 +647,46 @@ def get_geos_within(
"must be one of (state, nation), (state, hhs), (county, state)"
", (fips, state), (chng-fips, state)"
)

def aggregate_by_weighted_sum(
self, df: pd.DataFrame, to_geo: str, sensor: str, population_column: str
) -> pd.DataFrame:
"""Aggregate sensor, weighted by time-dependent population.

Note: This function generates its own population weights and adjusts the
weights based on which data is NA. This is in contrast to the
`replace_geocode` function, which assumes that the weights are already
present in the data and does not adjust for missing data (see the
docstring for the GeoMapper class).
dshemetov marked this conversation as resolved.
Show resolved Hide resolved

Parameters
---------
df: pd.DataFrame
Input dataframe, assumed to have a sensor column (e.g. "visits"), a
to_geo column (e.g. "state"), and a population column (corresponding
to a from_geo, e.g. "wastewater collection site").
to_geo: str
The column name of the geocode to aggregate to.
sensor: str
The column name of the sensor to aggregate.
population_column: str
The column name of the population to weight the sensor by.

Returns
---------
agg_df: pd.DataFrame
A dataframe with the aggregated sensor values, weighted by population.
"""
# Zero-out populations where the sensor is NA
df[f"relevant_pop_{sensor}"] = df[population_column] * df[sensor].abs().notna()
# Weight the sensor by the population
df[f"weighted_{sensor}"] = df[sensor] * df[f"relevant_pop_{sensor}"]
agg_df = df.groupby(["timestamp", to_geo]).agg(
{
f"relevant_pop_{sensor}": "sum",
f"weighted_{sensor}": lambda x: x.sum(min_count=1),
}
)
agg_df["val"] = agg_df[f"weighted_{sensor}"] / agg_df[f"relevant_pop_{sensor}"]
agg_df = agg_df.reset_index()
return agg_df
Loading