cmu-delphi · dshemetov · May 7, 2024 · May 6, 2024 · May 6, 2024 · May 7, 2024
diff --git a/_delphi_utils_python/data_proc/geomap/README.md b/_delphi_utils_python/data_proc/geomap/README.md
@@ -1,4 +1,4 @@
-# Geocoding data processing pipeline
+# Geocoding Data Processing
 
 Authors: Jingjing Tang, James Sharpnack, Dmitry Shemetov
 
@@ -7,42 +7,37 @@ Authors: Jingjing Tang, James Sharpnack, Dmitry Shemetov
 Requires the following source files below.
 
 Run the following to build the crosswalk tables in `covidcast-indicators/_delph_utils_python/delph_utils/data`
-```
+
+```sh
 $ python geo_data_proc.py
 ```
 
-You can see consistency checks and diffs with old sources in ./consistency_checks.ipynb
+Find data consistency checks in `./source-file-sanity-check.ipynb`.
 
 ## Geo Codes
 
 We support the following geocodes.
 
-- The ZIP code and the FIPS code are the most granular geocodes we support.
-  - The [ZIP code](https://en.wikipedia.org/wiki/ZIP_Code) is a US postal code used by the USPS and the [FIPS code](https://en.wikipedia.org/wiki/FIPS_county_code) is an identifier for US counties and other associated territories. The ZIP code is five digit code (with leading zeros). 
-  - The FIPS code is a five digit code (with leading zeros), where the first two digits are a two-digit state code and the last three are a three-digit county code (see this [US Census Bureau page](https://www.census.gov/library/reference/code-lists/ansi.html) for detailed information).
-- The Metropolitan Statistical Area (MSA) code refers to regions around cities (these are sometimes referred to as CBSA codes). More information on these can be found at the [US Census Bureau](https://www.census.gov/programs-surveys/metro-micro/about.html).
-  - We are reserving 10001-10099 for states codes of the form 100XX where XX is the FIPS code for the state (the current smallest CBSA is 10100). In the case that the CBSA codes change then it should be verified that these are not used.
+- The [ZIP code](https://en.wikipedia.org/wiki/ZIP_Code) is a US postal code used by the USPS and the [FIPS code](https://en.wikipedia.org/wiki/FIPS_county_code) is an identifier for US counties and other associated territories. The ZIP code is five digit code (with leading zeros).
+- The FIPS code is a five digit code (with leading zeros), where the first two digits are a two-digit state code and the last three are a three-digit county code (see this [US Census Bureau page](https://www.census.gov/library/reference/code-lists/ansi.html) for detailed information).
+- The Metropolitan Statistical Area (MSA) code refers to regions around cities (these are sometimes referred to as CBSA codes). More information on these can be found at the [US Census Bureau](https://www.census.gov/programs-surveys/metro-micro/about.html). We rserve 10001-10099 for states codes of the form 100XX where XX is the FIPS code for the state (the current smallest CBSA is 10100). In the case that the CBSA codes change then it should be verified that these are not used.
 - State codes are a series of equivalent identifiers for US state. They include the state name, the state number (state_id), and the state two-letter abbreviation (state_code). The state number is the state FIPS code. See [here](https://en.wikipedia.org/wiki/List_of_U.S._state_and_territory_abbreviations) for more.
 - The Hospital Referral Region (HRR) and the Hospital Service Area (HSA). More information [here](https://www.dartmouthatlas.org/covid-19/hrr-mapping/).
-FIPS codes depart in some special cases, so we produce manual changes listed below.
 
-## Source files
+## Source Files
 
 The source files are requested from a government URL when `geo_data_proc.py` is run (see the top of said script for the URLs). Below we describe the locations to find updated versions of the source files, if they are ever needed.
 
 - ZIP -> FIPS (county) population tables available from [US Census](https://www.census.gov/geographies/reference-files/time-series/geo/relationship-files.html#par_textimage_674173622). This file contains the population of the intersections between ZIP and FIPS regions, allowing the creation of a population-weighted transform between the two. As of 4 February 2022, this source did not include population information for 24 ZIPs that appear in our indicators. We have added those values manually using information available from the [zipdatamaps website](www.zipdatamaps.com).
 - ZIP -> HRR -> HSA crosswalk file comes from the 2018 version at the [Dartmouth Atlas Project](https://atlasdata.dartmouth.edu/static/supp_research_data).
 - FIPS -> MSA crosswalk file comes from the September 2018 version of the delineation files at the [US Census Bureau](https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/delineation-files.html).
-- State Code -> State ID -> State Name comes from the ANSI standard at the [US Census](https://www.census.gov/library/reference/code-lists/ansi.html#par_textimage_3). The first two digits of a FIPS codes should match the state code here.
+- State Code -> State ID -> State Name comes from the ANSI standard at the [US Census](https://www.census.gov/library/reference/code-lists/ansi.html#par_textimage_3).
 
-
-## Derived files
+## Derived Files
 
 The rest of the crosswalk tables are derived from the mappings above. We provide crosswalk functions from granular to coarser codes, but not the other way around. This is because there is no information gained when crosswalking from coarse to granular.
 
-
-
-## Deprecated source files
+## Deprecated Source Files
 
 - ZIP to FIPS to HRR to states: `02_20_uszips.csv` comes from a version of the table [here](https://simplemaps.com/data/us-zips) modified by Jingjing to include population weights.
   - The `02_20_uszips.csv` file is based on the newest consensus data including 5-digit zipcode, fips code, county name, state, population, HRR, HSA (I downloaded the original file from [here](https://simplemaps.com/data/us-zips). This file matches best to the most recent (2020) situation in terms of the population. But there still exist some matching problems. I manually checked and corrected those lines (~20) with [zip-codes](https://www.zip-codes.com/zip-code/58439/zip-code-58439.asp). The mapping from 5-digit zipcode to HRR is based on the file in 2017 version downloaded from [here](https://atlasdata.dartmouth.edu/static/supp_research_data).
@@ -51,7 +46,3 @@ The rest of the crosswalk tables are derived from the mappings above. We provide
 - CBSA -> FIPS crosswalk from [here](https://data.nber.org/data/cbsa-fips-county-crosswalk.html) (the file is `cbsatocountycrosswalk.csv`).
 - MSA tables from March 2020 [here](https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/delineation-files.html). This file seems to differ in a few fips codes from the source for the 02_20_uszip file which Jingjing constructed. There are at least 10 additional fips in 03_20_msa that are not in the uszip file, and one of the msa codes seems to be incorrect: 49020 (a google search confirms that it is incorrect in uszip and correct in the census data).
 - MSA tables from 2019 [here](https://apps.bea.gov/regional/docs/msalist.cfm)
-
-## Notes
-
-- The NAs in the coding currently zero-fills.
diff --git a/_delphi_utils_python/data_proc/geomap/geo_data_proc.py b/_delphi_utils_python/data_proc/geomap/geo_data_proc.py
@@ -1,10 +1,7 @@
 """
-Authors: Dmitry Shemetov @dshemetov, James Sharpnack @jsharpna
-
-Intended execution:
+Authors: Dmitry Shemetov, James Sharpnack
 
 cd _delphi_utils/data_proc/geomap
-chmod u+x geo_data_proc.py
 python geo_data_proc.py
 """
 
@@ -19,7 +16,7 @@
 
 
 # Source files
-YEAR = 2019
+YEAR = 2020
 INPUT_DIR = "./old_source_files"
 OUTPUT_DIR = f"../../delphi_utils/data/{YEAR}"
 FIPS_BY_ZIP_POP_URL = "https://www2.census.gov/geo/docs/maps-data/data/rel/zcta_county_rel_10.txt?#"
@@ -41,7 +38,6 @@
 FIPS_HHS_FILENAME = "fips_hhs_table.csv"
 FIPS_CHNGFIPS_OUT_FILENAME = "fips_chng-fips_table.csv"
 FIPS_POPULATION_OUT_FILENAME = "fips_pop.csv"
-
 CHNGFIPS_STATE_OUT_FILENAME = "chng-fips_state_table.csv"
 ZIP_HSA_OUT_FILENAME = "zip_hsa_table.csv"
 ZIP_HRR_OUT_FILENAME = "zip_hrr_table.csv"

diff --git a/_delphi_utils_python/delphi_utils/geomap.py b/_delphi_utils_python/delphi_utils/geomap.py
@@ -18,54 +18,90 @@ class GeoMapper:  # pylint: disable=too-many-public-methods
 
     The GeoMapper class provides utility functions for translating between different
     geocodes. Supported geocodes:
-    - zip: zip5, a length 5 str of 0-9 with leading 0's
-    - fips: state code and county code, a length 5 str of 0-9 with leading 0's
-    - msa: metropolitan statistical area, a length 5 str of 0-9 with leading 0's
-    - state_code: state code, a str of 0-9
-    - state_id: state id, a str of A-Z
-    - hrr: hospital referral region, an int 1-500
-
-    Mappings:
-    - [x] zip -> fips : population weighted
-    - [x] zip -> hrr : unweighted
-    - [x] zip -> msa : unweighted
-    - [x] zip -> state
-    - [x] zip -> hhs
-    - [x] zip -> population
-    - [x] state code -> hhs
-    - [x] fips -> state : unweighted
-    - [x] fips -> msa : unweighted
-    - [x] fips -> megacounty
-    - [x] fips -> hrr
-    - [x] fips -> hhs
-    - [x] fips -> chng-fips
-    - [x] chng-fips -> state : unweighted
-    - [x] nation
-    - [ ] zip -> dma (postponed)
-
-    The GeoMapper instance loads crosswalk tables from the package data_dir. The
-    crosswalk tables are assumed to have been built using the geo_data_proc.py script
-    in data_proc/geomap. If a mapping between codes is NOT one to many, then the table has
-    just two colums. If the mapping IS one to many, then a third column, the weight column,
-    exists (e.g. zip, fips, weight; satisfying (sum(weights) where zip==ZIP) == 1).
+
+    - zip:          five characters [0-9] with leading 0's, e.g. "33626"
+                    also known as zip5 or zip code
+    - fips:         five characters [0-9] with leading 0's, e.g. "12057"
+                    the first two digits are the state FIPS code and the last
+                    three are the county FIPS code
+    - msa:          five characters [0-9] with leading 0's, e.g. "90001"
+                    also known as metropolitan statistical area
+    - state_code:   two characters [0-9], e.g "06"
+    - state_id:     two characters [A-Z], e.g "CA"
+    - state_name:   human-readable name, e.g "California"
+    - state_*:      we use this below to refer to the three above geocodes in aggregate
+    - hrr:          an integer from 1-500, also known as hospital
+                    referral region
+    - hhs:          an integer from 1-10, also known as health and human services region
+                    https://www.hhs.gov/about/agencies/iea/regional-offices/index.html
+
+    Valid mappings:
+
+    From            To              Population Weighted
+    zip             fips            Yes
+    zip             hrr             No
+    zip             msa             Yes
+    zip             state_*         Yes
+    zip             hhs             Yes
+    zip             population      --
+    zip             nation          No
+    state_*         state_*         No
+    state_*         hhs             No
+    state_*         population      --
+    state_*         nation          No
+    fips            state_*         No
+    fips            msa             No
+    fips            megacounty      No
+    fips            hrr             Yes
+    fips            hhs             No
+    fips            chng-fips       No
+    fips            nation          No
+    chng-fips       state_*         No
+
+    Crosswalk Tables
+    ================
+
+    The GeoMapper instance loads pre-generated crosswalk tables (built by the
+    script in `data_proc/geomap/geo_data_proc.py`). If a mapping between codes
+    is one to one or many to one, then the table has just two columns. If the
+    mapping is one to many, then a weight column is provided, which gives the
+    fractional population contribution of a source_geo to the target_geo. The
+    weights satisfy the condition that df.groupby(from_code).sum(weight) == 1.0
+    for all values of from_code.
+
+    Aggregation
+    ===========
+
+    The GeoMapper class provides functions to aggregate data from one geocode
+    to another. The aggregation can be a simple one-to-one mapping or a
+    weighted aggregation. The weighted aggregation is useful when the data
+    being aggregated is a population-weighted quantity, such as visits or
+    cases. The aggregation is done by multiplying the data columns by the
+    weights and summing over the data columns. Note that the aggregation does
+    not adjust the aggregation for missing or NA values in the data columns,
+    which is equivalent to a zero-fill.
 
     Example Usage
-    ==========
+    =============
     The main GeoMapper object loads and stores crosswalk dataframes on-demand.
 
-    When replacing geocodes with a new one an aggregation step is performed on the data columns
-    to merge entries  (i.e. in the case of a many to one mapping or a weighted mapping). This
-    requires a specification of the data columns, which are assumed to be all the columns that
-    are not the geocodes or the date column specified in date_col.
+    When replacing geocodes with a new one an aggregation step is performed on
+    the data columns to merge entries  (i.e. in the case of a many to one
+    mapping or a weighted mapping). This requires a specification of the data
+    columns, which are assumed to be all the columns that are not the geocodes
+    or the date column specified in date_col.
 
     Example 1: to add a new column with a new geocode, possibly with weights:
     > gmpr = GeoMapper()
-    > df = gmpr.add_geocode(df, "fips", "zip", from_col="fips", new_col="geo_id",
+    > df = gmpr.add_geocode(df, "fips", "zip",
+                            from_col="fips", new_col="geo_id",
                             date_col="timestamp", dropna=False)
 
-    Example 2: to replace a geocode column with a new one, aggregating the data with weights:
+    Example 2: to replace a geocode column with a new one, aggregating the data
+    with weights:
     > gmpr = GeoMapper()
-    > df = gmpr.replace_geocode(df, "fips", "zip", from_col="fips", new_col="geo_id",
+    > df = gmpr.replace_geocode(df, "fips", "zip",
+                                from_col="fips", new_col="geo_id",
                                 date_col="timestamp", dropna=False)
     """
 
@@ -113,7 +149,7 @@ def __init__(self, census_year: int = 2020):
             subkey
             for mainkey in self.CROSSWALK_FILENAMES
             for subkey in self.CROSSWALK_FILENAMES[mainkey]
-        }.union(set(self.CROSSWALK_FILENAMES.keys())) - set(["state", "pop"])
+        }.union(set(self.CROSSWALK_FILENAMES.keys())) - {"state", "pop"}
 
         for from_code, to_codes in self.CROSSWALK_FILENAMES.items():
             for to_code, file_path in to_codes.items():
@@ -135,7 +171,6 @@ def _load_crosswalk_from_file(
             "weight": float,
             **{geo: str for geo in self._geos - set("nation")},
         }
-
         usecols = [from_code, "pop"] if to_code == "pop" else None
         return pd.read_csv(stream, dtype=dtype, usecols=usecols)
 
@@ -229,13 +264,6 @@ def add_geocode(
     ):
         """Add a new geocode column to a dataframe.
 
-        Currently supported conversions:
-        - fips -> state_code, state_id, state_name, zip, msa, hrr, nation, hhs, chng-fips
-        - chng-fips -> state_code, state_id, state_name
-        - zip -> state_code, state_id, state_name, fips, msa, hrr, nation, hhs
-        - state_x -> state_y (where x and y are in {code, id, name}), nation
-        - state_code -> hhs, nation
-
         Parameters
         ---------
         df: pd.DataFrame
@@ -303,7 +331,7 @@ def add_geocode(
             df = df.merge(crosswalk, left_on=from_col, right_on=from_col, how="left")
 
         # Drop extra state columns
-        if new_code in state_codes and not from_code in state_codes:
+        if new_code in state_codes and from_code not in state_codes:
             state_codes.remove(new_code)
             df.drop(columns=state_codes, inplace=True)
         elif new_code in state_codes and from_code in state_codes:
@@ -345,13 +373,6 @@ def replace_geocode(
     ) -> pd.DataFrame:
         """Replace a geocode column in a dataframe.
 
-        Currently supported conversions:
-        - fips -> chng-fips, state_code, state_id, state_name, zip, msa, hrr, nation
-        - chng-fips -> state_code, state_id, state_name
-        - zip -> state_code, state_id, state_name, fips, msa, hrr, nation
-        - state_x -> state_y (where x and y are in {code, id, name}), nation
-        - state_code -> hhs, nation
-
         Parameters
         ---------
         df: pd.DataFrame
@@ -397,7 +418,7 @@ def replace_geocode(
             df[data_cols] = df[data_cols].multiply(df["weight"], axis=0)
             df.drop("weight", axis=1, inplace=True)
 
-        if not date_col is None:
+        if date_col is not None:
             df = df.groupby([date_col, new_col]).sum(numeric_only=True).reset_index()
         else:
             df = df.groupby([new_col]).sum(numeric_only=True).reset_index()
@@ -575,8 +596,7 @@ def get_geos_within(
         Return all contained regions of the given type within the given container geocode.
 
         Given container_geocode (e.g "ca" for California) of type container_geocode_type
-        (e.g "state"), return:
-            - all (contained_geocode_type)s within container_geocode
+        (e.g "state"), return all (contained_geocode_type)s within container_geocode.
 
         Supports these 4 combinations:
             - all states within a nation
@@ -627,3 +647,46 @@ def get_geos_within(
             "must be one of (state, nation), (state, hhs), (county, state)"
             ", (fips, state), (chng-fips, state)"
         )
+
+    def aggregate_by_weighted_sum(
+        self, df: pd.DataFrame, to_geo: str, sensor: str, population_column: str
+    ) -> pd.DataFrame:
+        """Aggregate sensor, weighted by time-dependent population.
+
+        Note: This function generates its own population weights and adjusts the
+        weights based on which data is NA. This is in contrast to the
+        `replace_geocode` function, which assumes that the weights are already
+        present in the data and does not adjust for missing data (see the
+        docstring for the GeoMapper class).
+
+        Parameters
+        ---------
+        df: pd.DataFrame
+            Input dataframe, assumed to have a sensor column (e.g. "visits"), a
+            to_geo column (e.g. "state"), and a population column (corresponding
+            to a from_geo, e.g. "wastewater collection site").
+        to_geo: str
+            The column name of the geocode to aggregate to.
+        sensor: str
+            The column name of the sensor to aggregate.
+        population_column: str
+            The column name of the population to weight the sensor by.
+
+        Returns
+        ---------
+        agg_df: pd.DataFrame
+            A dataframe with the aggregated sensor values, weighted by population.
+        """
+        # Zero-out populations where the sensor is NA
+        df[f"relevant_pop_{sensor}"] = df[population_column] * df[sensor].abs().notna()
+        # Weight the sensor by the population
+        df[f"weighted_{sensor}"] = df[sensor] * df[f"relevant_pop_{sensor}"]
+        agg_df = df.groupby(["timestamp", to_geo]).agg(
+            {
+                f"relevant_pop_{sensor}": "sum",
+                f"weighted_{sensor}": lambda x: x.sum(min_count=1),
+            }
+        )
+        agg_df["val"] = agg_df[f"weighted_{sensor}"] / agg_df[f"relevant_pop_{sensor}"]
+        agg_df = agg_df.reset_index()
+        return agg_df