Google Research's Open COVID-19 Data project is an open source pipeline that aggregates public COVID-19 data sources into a single dataset. The data includes time series data for COVID-19 cases, deaths, tests, hospitalizations, discharges, intensive case unit (ICU) cases, ventilator cases, government interventions, and Google's Community Mobility Reports and Search Trends symptoms dataset.
COVID-19 data is published from many distinct sources with highly heterogenous formats. The goal of this pipeline is to accept data in many different formats, and to process it into a standardized and consistent schema. Having data in a consistent schema allows researchers to build models quickly, while the pipeline is designed for engineers to add new data sources quickly.
The pipeline supports three ways of ingesting data:
- Automatic downloads: data that can be downloaded as a .csv or .xslx file from a consistent url
- Manual downloads: data that can be downloaded as a .csv or .xslx, but must be downloaded manually because the url changes
- Scraped data: data that is not machine-readable and must be scraped by a human (e.g. from charts, tables, pdfs, or occasionally tweets)
For each data source, this repository has a configuration file located in src/config/sources
that specifies how the pipeline should map the original data into our schema. Raw data is fetched from the data source and written into a directory within data/inputs
. Exported data that has been transformed into our schema is found in the data/exports
directory.
If you just want to use the latest data for models, visualizations, or research, we provide aggregated data files under different licenses. This is to provide you with options so that you can use data with a license that is acceptable for your use case, while respecting the original licenses of the data sources.
- Aggregated data under a CC-BY license can be downloaded from this link
- Aggregated data under a CC-BY-SA license can be downloaded from this link.
- Aggregated data under a CC-BY-NC license can be downloaded from this link.
- There are two data sources released under Google Terms of Service. To download or use the data, you must agree to the Google Terms of Service.
- Google's Community Mobility Reports can be downloaded from this directory
- Google's Search Trends Symptoms Dataset can be downloaded from this directory
Please see the Data Sources section of this README to note the attributions and licenses for each source.
Every location is assigned an open_covid_region_code
, which is a unique hierarchical location code that can be used to join data across tables in this repository. The full list of locations that are assigned an open_covid_region_code
can be found at data/exports/locations/locations.csv
. Where available, we also provide a datacommons_id
and wikidata_id
field for each location.
Each open_covid_region_code
has up to three levels:
- The first-level region codes are
ISO-3166-1
codes, e.g.IT
for Italy - The second-level region codes are, by default,
ISO-3166-2
codes. For example,US-AL
for Alabama. However, in some locations, COVID-19 data is reported in administrative regions other thanISO-3166-2
, so the choice of sub-country regions is informed partially by data availability. - Third-level regions include cities and counties - within the United States counties are coded using
FIPS 6-4
codes.
All dates are mapped to ISO 8601
format during data loading, e.g. 2020-08-15
.
We have carefully checked the license and attribution information on each data source included in this repository, and in many cases have contacted the data owners directly to ask how they would like to be attributed.
If you are the owner of a data source included here and would like us to remove data, add or alter an attribution, or add or alter license information, please do not hesitate to email us at [email protected] and we will happily consider your request.
If you would like to run the pipeline locally or to contribute to the codebase, here are instructions for installation and adding new data sources.
To install Python dependencies:
pip install pandas xlrd pyyaml python3-wget
To run the main script that runs the entire pipeline on the data that is in data/inputs
:
python src/scripts/export_data.py
In addition, there are two scripts that can be run to fetch new data and write it into data/inputs
.
To fetch data that can be automatically downloaded:
python src/scripts/fetch_automatic_downloads.py
To fetch data from a spreadsheet in data/inputs/scraped/spreadsheets/
:
python src/scripts/fetch_scraped_data.py
The pipeline is structured so that raw data is always fetched into data/inputs
before being consumed by the rest of the pipeline. Data sources for each data type are then loaded into pandas dataframes with a standardized schema for dates, locations, and columns. These dataframes are joined into a single dataframe, which is then exported.
Before adding a new data source, we go through an internal approval within Google to ensure compliance with licensing and terms. Once a data source is approved, you can add the data to the pipeline as follows:
- If the source includes a data type that isn't yet included in the data schema, register the data type in the schema by adding an entry to
src/config/data.yaml
.
- Specify the
fetch
parameters:
source_url
: where to download the datamethod
: one ofAUTOMATIC_DOWNLOAD
,MANUAL_DOWNLOAD
,SCRAPED
,STATIC
file
: filename for the data source
- Specify the
load
parameters.
function
: which function inload_functions.py
to use to load the data. Most data sources can be loaded withdefault_load_function
, but some data sources will have formatting that requires implementing a new function inload_functions.py
.read
: data sources are read using thepandas.read_csv()
orpandas.read_excel()
functions. Theread
field accepts key/val parameters that are passed to the appropriate pandas read function.dates
:
columns
: list of column names in the original data source that are required as arg to a function that will return the date in ISO-8601 format. This is often just a single column, but sometimes the year/month/date are in separate columns in the original data.date_format
: the format of the date in the original data sourceparse_function
: most dates can be parsed using thedefault
function indate_utils.py
. If the data source has a date format that requires a parser that doesn't exist indate_utils.py
, implement a separate function in that file.
regions
:mapping_keys
: if a data source contains multiple regions but not ISO-3166 codes for the regions, the locations file atdata/exports/locations/locations.csv
must contain a column or list of columns that can be uniquely map the locations in the data to theregion_code
for that location. Themapping_keys
field takes key/value fields where the key is the column in the locations file, and the value is the string name of the column in the original data source.
- Specify the
data
parameters:- These parameters follow the data schema specified in
src/config/data.yaml
, where the keys come from the data schema and the values are the column name in the original data source for the corresponding data.
- These parameters follow the data schema specified in
- Specify the
attribution
parameters. These are used to generate the data source section of the README. The fields for existing data sources serve as an example of what to include. - Specify the
license
parameters. These are used to generate the LICENSE file. The fields for existing data sources serve as an example of what to include. - Specify the
cc_by
andcc_by_sa
fields: we produce two aggregated csv files, one is licensed underCC-BY
and the other is underCC-BY-SA
. These fields specify whether the data can appear in each file.
- When you run
src/scripts/export_data.py
, it will update theREADME.md
as well as theLICENSE
files withindata/exports
.
This repository is created and maintained by Katie Everett, Dan Nanas, Maddy Myers (UCSD), Sumit Arora, and Ian Fischer.
Source name: covid19data.com.au (link)
Link to data: https://www.covid19data.com.au/hospitalisations-icu
Description: Data is scraped manually from the charts provided at the source link. Data for Australia consists of time series data for current hospitalizations, ICU and ventilator cases.
License: Creative Commons Attribution 4.0 International (link)
Last accessed: 2020-08-31
Source name: COVID-19 Tracking Project (link)
Link to data: https://github.com/COVID19Tracking/covid-tracking-data/tree/master/data
Description: Data is downloaded automatically from the source link. Data for the United States consists of time series data for current and cumulative hospitalizations.
License: Apache 2.0 (link)
Last accessed: 2020-09-01
Original data source: GOV.CO (link)
Link to original data: https://www.datos.gov.co/Salud-y-Protecci-n-Social/Casos-positivos-de-COVID-19-en-Colombia/gt2j-8ykr/data
Data aggregated by: COVID-19 Colombia (link)
License: Creative Commons Attribution-ShareAlike 4.0 International (link)
Last accessed: 2020-09-01
Source name: National Health Information System, Regional Hygiene Stations, Ministry of Health of the Czech Republic (link)
Link to data: https://onemocneni-aktualne.mzcr.cz/covid-19
Description: Data is scraped manually from the charts provided at the source link. Data for the Czech Republic consists of time series data for current ICU cases, and current and cumulative hospitalizations.
Citation:
Komenda M., Karolyi M., Bulhart V., Žofka J., Brauner T., Hak J., Jarkovský J., Mužík J., Blaha M., Kubát J., Klimeš D., Langhammer P., Daňková Š ., Májek O., Bartůňková M., Dušek L. COVID ‑ 19: Přehled aktuální situace v ČR. Onemocnění aktuálně [online]. Praha: Ministerstvo zdravotnictví ČR, 2020 [cit. 25.04.2020]. Dostupné z: https://onemocneni-aktualne.mzcr.cz/covid-19. Vývoj: společné pracoviště ÚZIS ČR a IBA LF MU. ISSN 2694-9423.
Last accessed: 2020-08-31
Source name: Statens Serum Institute (link)
Link to data: https://www.sst.dk/da/corona/tal-og-overvaagning
Description: Data is manually scraped from charts at the source link. Data for Denmark consists of time series data for current hospitalizations and ICU cases.
Last accessed: 2020-08-31
Source name: Finnish institute for health and welfare (link)
Link to data: https://thl.fi/en/web/infectious-diseases/what-s-new/coronavirus-covid-19-latest-updates
License: Creative Commons Attribution 4.0 International (link)
Last accessed: 2020-08-31
Source name: data.gouv.fr (link)
Link to data: https://www.data.gouv.fr/en/datasets/donnees-hospitalieres-relatives-a-lepidemie-de-covid-19/
Description: Data is scraped manually from the charts provided at the source link. Data for France consists of time series data for cumulative hospitalizations and ICU cases.
License: Open License 2.0 (link)
Last accessed: 2020-09-01
Source name: Google's COVID19 Community Mobility Reports (link)
Link to data: https://www.gstatic.com/covid19/mobility/Global_Mobility_Report.csv
Help Center: https://support.google.com/covid19-mobility
Description: These Community Mobility Reports aim to provide insights into what has changed in response to policies aimed at combating COVID-19. The reports chart movement trends over time by geography, across different categories of places.
Terms: In order to download or use the data or reports, you must agree to the Google Terms of Service.
License: Google Terms of Service (link)
Citation:
Google LLC "Google COVID-19 Community Mobility Reports".
https://www.google.com/covid19/mobility/ Accessed: <date>.
Last accessed: 2020-08-28
Source name: Google's COVID19 Search Trends symptoms dataset (link)
Link to data: http://goo.gle/covid19symptomdataset
Description: The COVID-19 Search Trends symptoms dataset shows aggregated, anonymized trends in Google searches for symptoms, signs and some health conditions. The dataset provides a daily or weekly time series for each region showing the relative volume of searches for each symptom.
Terms: In order to download or use the data or reports, you must agree to the Google Terms of Service.
License: Google Terms of Service (link)
Citation:
Google LLC "Google COVID-19 Search Trends symptoms dataset".
http://goo.gle/covid19symptomdataset, Accessed: <date>.
Last accessed: 2020-08-30
Source name: Directorate of Health in Iceland (Embaetti landlaeknis) (link)
Link to data: https://www.covid.is/data
Description: Data is downloaded manually from the source link. Data for Iceland consists of time series data for current ICU cases, and current and cumulative hospitalizations.
Last accessed: 2020-06-22
Source name: Health Protection Surveillance Centre (link)
Link to data: https://www.hpsc.ie/a-z/respiratory/coronavirus/novelcoronavirus/casesinireland/epidemiologyofcovid-19inireland/
Description: Data is scraped manually from daily situation reports. Data for Ireland consists of time series data for cumulative hospitalizations.
License: Creative Commons Attribution ShareAlike 3.0 (link)
Last accessed: 2020-08-31
Source name: Dipartimento della Protezione Civile (link)
Link to data: https://github.com/pcm-dpc/COVID-19
Description: Data is downloaded automatically from the source repository. Data for Italy consists of time series data for current hospitalizations, but we can also compute cumulative hospitalizations.
License: Creative Commons Attribution 4.0 International (link)
Last accessed: 2020-09-01
Source name: Toyo Keizai Online (link)
Link to data: https://github.com/kaz-ogiwara/covid19
Copyright notice: Copyright (c) 2020 Kazuki OGIWARA / 荻原 和樹
Description: Data is downloaded automatically from the source repository. Data for Japan consists of time series data for current hospitalizations and ICU cases.
License: MIT (link)
Last accessed: 2020-08-03
Source name: Luxembourg Ministry of Health (link)
Link to data: https://data.public.lu/fr/datasets/donnees-covid19/#_
Description: Data is downloaded automatically from the source link. Data for Luxembourg consists of time series data for current hospitalizations and ICU cases.
License: Creative Commons Zero 1.0 Universal (link)
Last accessed: 2020-09-01
Source name: Ministry of Health, Labour and Social Protection (link)
Link to data: https://msmps.gov.md/ro/advanced-page-type/comunicate-de-presa
Last accessed: 2020-08-31
Source name: National Institute for Public Health and The Environment (link)
Link to data: https://www.rivm.nl/coronavirus-covid-19/grafieken
Description: Data is downloaded manually from the source link. Data for the Netherlands consists of time series data for current hospitalizations.
Last accessed: 2020-06-29
Source name: New Zealand Ministry of Health (link)
Link to data: https://www.health.govt.nz/our-work/diseases-and-conditions/covid-19-novel-coronavirus/covid-19-current-situation/covid-19-current-cases
Last accessed: 2020-08-31
Source name: Norwegian Institute of Public Health (link)
Link to data: https://www.fhi.no/en/id/infectious-diseases/coronavirus/daily-reports/daily-reports-COVID19/
Last accessed: 2020-06-22
Source name: Our World in Data (link)
Link to data: https://github.com/owid/covid-19-data/tree/master/public/data
License: Creative Commons Attribution 4.0 International (link)
Citation:
Data from Our World in Data has been collected, aggregated, and documented by Diana Beltekian, Daniel Gavrilov, Charlie Giattino, Joe Hasell, Bobbie Macdonald, Edouard Mathieu, Esteban Ortiz-Ospina, Hannah Ritchie, and Max Roser.
Last accessed: 2020-09-01
Source name: Oxford Covid-19 Government Response Tracker (link)
Link to data: https://github.com/OxCGRT/covid-policy-tracker/blob/master/data/OxCGRT_latest.csv
License: Creative Commons Attribution 4.0 International (link)
Citation:
Thomas Hale, Sam Webster, Anna Petherick, Toby Phillips, and Beatriz Kira. (2020). Oxford COVID-19 Government Response Tracker. Blavatnik School of Government.
Last accessed: 2020-09-01
Source name: Philippines Department of Health (link)
Link to data: http://www.doh.gov.ph/covid19tracker
Last accessed: 2020-08-31
Source name: Ministerio de Sanidad, Consumo y Bienestar Social (link)
Link to data: https://cnecovid.isciii.es/covid19/resources/agregados.csv
Description: The data is downloaded automatically from the source link. Due to regional differences in hospitalization reporting, we do not aggregate across regions to produce country-level statistics for Spain.
Last accessed: 2020-09-01
Source name: Public Health Agency of Sweden (link)
Link to data: https://www.arcgis.com/sharing/rest/content/items/b5e7488e117749c19881cce45db13f7e/data
Description: Data is downloaded automatically from the source link. Data for Sweden consists of time series data for current ICU cases.
Last accessed: 2020-09-01
Source name: Switzerland Federal Office of Public Health BAG (link)
Link to data: https://www.bag.admin.ch/bag/de/home/krankheiten/ausbrueche-epidemien-pandemien/aktuelle-ausbrueche-epidemien/novel-cov/situation-schweiz-und-international.html
Last accessed: 2020-06-29
Source name: The New York Times COVID-19 Data (link)
Link to data: https://github.com/nytimes/covid-19-data
License: Creative Commons Attribution-NonCommercial 4.0 International (link)
Citation:
Data from The New York Times, based on reports from state and local health agencies.
Last accessed: 2020-09-01
Source name: GOV.UK (link)
Link to data: https://www.gov.uk/government/publications/
Description: Data is downloaded manually from the publications provided at the source link. Data is aggregated across regions in England and reported at the country level for England, Scotland, Wales and Northern Ireland. Data consists of time series data for current hospitalizations.
License: Open Government License 3.0 (link)
Last accessed: 2020-06-23