Skip to content

Repository extracting vaccination statistics from PDFs published by the Department of Health (Australia) in machine-readable formats

License

Notifications You must be signed in to change notification settings

jxeeno/aust-govt-covid19-vaccine-pdf

Repository files navigation

Australian COVID-19 Vaccination Data

This repository is a mess of code which:

  1. Takes in the PDF file published by the Australian Department of Health
  2. Takes in national second dose data from the WA Health vaccination dashboard
  3. Converts it into machine-readable statistics (JSON and CSV files)
  4. Publishes data files via GitHub Actions and GitHub Pages

Looking for COVID-19 Case and Test Data? That data is in a separate repository: https://github.com/jxeeno/aust-govt-covid19-stats

Notes

Due to changes in the way vaccination data is reported throughout this year, some of the data may not be comparable. This section tries to summarise many of the data issues and reporting changes. It's lengthy -- you have been warned!

From 18 May 2021 to 1 July 2021

Prior to 18 May 2021, statistics about second dose (or number of fully vaccinated people) were not available.

From 18 May 2021 to 1 July 2021, approximate second dose data by state of administration (data values in key format APPROX_<State>_SECOND_DOSE_TOTAL in all.csv) is derived from the WA Health Vaccination Dashboard which is updated weekly usually at the start of the week (exact day of week varies).

The percentages are extracted from the WA Health dashboard and multiplied against the ABS 16 and over population data as noted in the WA Health interpretation guide.

State-level second dose totals are not published daily and therefore do not necessarily correlate with daily dose totals. This means that deducting total doses from approx second dose total does not produce an accurate value for first doses.

From 1 July 2021 to 27 July 2021

The APPROX_<State>_SECOND_DOSE_TOTAL field in all.csv is deprecated and should not be used any more. From 17 August 2021, these fields will no longer be populated.

From 30 June 2021, the Department of Health started publishing breakdowns of doses by age group and first/second doses. This data is derived from the Australian Immunisation register (AIR) and may not correspond directly with the headline figures which include self-reported figures.

From 1 July 2021, the Department of Health started publishing breakdowns of doses by age group, first/second doses and by state of administration. This data is derived from the Australian Immunisation register (AIR) and may not correspond directly with the headline figures which include self-reported figures.

This data is published separately as air.csv and air.json files.

From 28 July 2021 to 15 August 2021

From 28 July 2021, the Department of Health started publishing breakdowns of doses by age group, first/second doses and by state of residence. This data is derived from the Australian Immunisation register (AIR) and may not correspond directly with the headline figures which include self-reported figures.

The switch from reporting state of administration to state of residence resulted in some states reporting a decrease in total number of doses. ACT and NT both reported drops.

Additional data points (doses by state of residence, by age group) were also published and is made available separately as air_residence.csv and air_residence.json files.

From 15 August 2021 onwards

From 15 August 2021, all data reported by the Department of Health is obtained from the Australian Immunisation Register (AIR). Previously, some statistics were based on self-reported figures from state-run clinics.

This reporting change resulted in negative doses in some states reported in the all.csv file. This drop is due to the lag between a dose being administered and the record being entered into AIR.

This change resulted meant that the statistics in the all.csv file prior to 15 Aug 2021 is not directly comparable to the data published on or after 15 Aug 2021.

All data up to this point is based on date of reporting. Department of Health has also begun reporting doses on the date of administration. This data is not available in this repository yet.

From 16 August 2021 onwards

Vaccination rates for Aboriginal and Torres Strait Islander peoples (First Nations people) are now included in the dataset. This data was uploaded on 8 September, dating back to 16 August 2021.

Department of Health updates this data on a weekly basis and was included in the daily vaccination data pack since 16 August 2021.

These appear as FIRST_NATIONS_<STATE|TERRITORY|AUS>_<FIRST|SECOND>_DOSE_TOTAL in all.csv and all.json.

From 6 September 2021 onwards

Department of Health no longer publishes the first and second dose and visit count breakdowns for aged and disability care.

This means the following fields are now deprecated:

  • CWTH_AGED_CARE_DOSES_FIRST_DOSE
  • CWTH_AGED_CARE_DOSES_SECOND_DOSE
  • CWTH_AGED_CARE_FACILITIES_FIRST_DOSE
  • CWTH_AGED_CARE_FACILITIES_SECOND_DOSE

From 13 September 2021 onwards

Department of Health is now publishing dose data for 12-15 year olds. This data is available in air.csv as:

  • AIR_12_15_<FIRST|SECOND>_DOSE_<COUNT|PCT>
  • AIR_<STATE>_12_15_<FIRST|SECOND>_DOSE_<COUNT|PCT>

From 15 September 2021 onwards

All totals are now include 12-15 age groups.

From 7 November 2021 onwards

Department of Health is now publishing the number of individuals aged 16+ with 3 doses (or more) of COVID-19 vaccine recorded in AIR. This data is available in air.csv/air.json as:

  • AIR_AUS_16_PLUS_THIRD_DOSE_COUNT
  • AIR_AUS_16_PLUS_THIRD_DOSE_PCT

From 10 November 2021 onwards

Vaccination rates for Aboriginal and Torres Strait Islander peoples (First Nations people) are now also expressed as a percentage of population. This data was uploaded on 25 November, dating back to 10 November 2021.

Department of Health updates this data on a weekly basis and was included in the daily vaccination data pack since 10 November 2021.

These appear as FIRST_NATIONS_<STATE|TERRITORY|AUS>_<FIRST|SECOND>_PCT_TOTAL in all.csv and all.json, in addition to existing FIRST_NATIONS_<STATE|TERRITORY|AUS>_<FIRST|SECOND>_DOSE_TOTAL which counts the total number of doses.

From 9 January 2022 onwards

Department of Health is now publishing the number of individuals aged 18+ with 3 doses (or more) of COVID-19 vaccine recorded in AIR. This data is available in air.csv/air.json as:

  • AIR_<STATE|TERRITORY|AUS>_18_PLUS_THIRD_DOSE_COUNT
  • AIR_<STATE|TERRITORY|AUS>_18_PLUS_THIRD_DOSE_PCT

Note: the keys from the 7 November 2021 change was 16_PLUS for AUS. However, in this release, the key has changed to 18_PLUS in line with booster eligibility criteria. The 16_PLUS column for AUS is maintained, however, a duplicated 18_PLUS column for AUS is also available for consistency.

*_16_PLUS_THIRD_DOSE_PCT uses 16+ population as denominator, even though only 18+ population is eligible. This field should be avoided and is here for backwards compatability purposes. *_18_PLUS_THIRD_DOSE_PCT uses 18+ population as denominator. This is current eligible age group for boosters.

From 10 January 2022 onwards

Department of Health is now publishing the number of individuals aged 5-11 with at least 1 dose of COVID-19 vaccine recorded in AIR. This data is available in:

air.csv/air.json:

  • AIR_<STATE|TERRITORY|AUS>_5_11_FIRST_DOSE_COUNT
  • AIR_<STATE|TERRITORY|AUS>_5_11_FIRST_DOSE_PCT
  • AIR_<STATE|TERRITORY|AUS>_5_11_POPULATION

The following fields have been provisioned for, and will be available when the government publishes the data

  • AIR_<STATE|TERRITORY|AUS>_5_11_SECOND_DOSE_COUNT
  • AIR_<STATE|TERRITORY|AUS>_5_11_SECOND_DOSE_PCT

For consistency, the following fields have also been added:

  • AIR_AUS_12_15_<FIRST|SECOND>_DOSE_<COUNT|PCT> (same value as AIR_12_15_<FIRST|SECOND>_DOSE_<COUNT|PCT>)

air_residence.csv/air_residence.json:

Where AGE_LOWER is 5 and AGE_UPPER is 11.

Note: Due to changes to the data layout, summary vaccination rates by jurisdiction for 50+ and 70+ are no longer available. I will add middleware in due course to estimate these figures.

From 28 Feb 2022 onwards

Department of Health is now publishing the percentage of eligible individuals who have received a third dose of COVID-19 vaccine by geographical region. This is different from other representation percentage data which is generally to the population. To report against a common metric, we have estimated the actual number of people with at least 3 doses and percentage of population by comparing against second dose counts 12 weeks ago (approx 3 months).

This data is available in:

  • air_<lga|sa4|sa3>.csv
  • air_<lga|sa4|sa3>.json

The following fields have been added:

  • AIR_THIRD_DOSE_ELIGIBLE_PCT - % of eligible people with at least three doses of vaccine, as provided by DOH
  • AIR_THIRD_DOSE_APPROX_COUNT - estimated number of people with at least three doses of vaccine (derived by AIR_THIRD_DOSE_ELIGIBLE_PCT * AIR_SECOND_DOSE_APPROX_COUNT from 12 weeks ago)
  • AIR_THIRD_DOSE_PCT - estimated % of population with at least three doses of vaccine (derived by AIR_THIRD_DOSE_APPROX_COUNT / ABS_ERP_2019_POPULATION)

NOTE: The order of the columns in the CSV has changed to accomodate this additional data

Fixes:

  • Resolved an issue with booster by jurisdiction data being missing due to a change of layout
  • Fixed LGA matching issue with Nambucca Valley / Nambucca
  • Fixed LGA matching issue with DOH merged LGA: Mount Gambier (C) & Grant (DC)

Attribution

You must attribute the source of the data as Department of Health (all data except second doses by state prior to 1st July 2021) and WA Health (second dose by state data prior to 1st July 2021).

When using this data extract, I'd appreciate it if you attribute data extraction to myself (Ken Tsang) and link to this repository. This will be greatly appreciated, but not required.

Example:

Source: WA Health (second dose by state data prior to 1st July 2021) and Department of Health (all other data); Data extracted by Ken Tsang

Programmatic access to data

The data is also available at the following locations:

Daily dose administration data (and weekly second dose by state data from 18 May 2021 onwards)

Daily dose breakdown from AIR incl state of administration data (available from 30 June 2021 onwards)

Daily dose breakdown from AIR incl state of residence data (available from 28 July 2021 onwards)

Note: AIR residence data is normalised differently from the other data. Each row represents a separate day, state and age bucket. Counts are estimated by reverse calculating percentage and ABS Estimated Resident Population.

Weekly dose distribution data

Index to raw data extracts

Programatic access to geographical vaccination rates

Geographical vaccination rates are updated weekly.

Statistical Area 4

Vaccination rates by address of residence, grouped by ABS Statistical Area 4.

Statistical Area 3

Vaccination rates by address of residence, grouped by ABS Statistical Area 3. SA3s with less than 500 people aged 15 and over have been excluded.

Local Government Areas

Vaccination rates by address of residence, grouped by ABS Local Government Areas. LGAs with large ‘very remote’ and ‘remote’ areas where geo-coding addresses difficult are excluded.

Statistical Area 4 (Indigenous population)

Vaccination rates of the Indigenous population by address of residence, grouped by ABS Statistical Area 4.

Note: ABS_ERP_2019_POPULATION represents the general estimated resident population for the SA4. AIR_INDIGENOUS_POPULATION is provided as an estimate of the Indigenous population in the SA4 based on records in the Australian Immunisation Register / Medicare. Percentages are calculated using AIR_INDIGENOUS_POPULATION as denominator.

By Postal Area (VIC only)

Vaccination rates by address of residence, grouped by ABS Postal Areas (POA). POAs with significant population change since the 2016 census are excluded as it is not possible to accurately provide vaccination rates.

Vaccination rates are expressed as percent ranges in 5% increments.

This data is obtained from https://www.coronavirus.vic.gov.au/weekly-covid-19-vaccine-data

Legacy feed for Statistical Area 4

This is the legacy SA4 feed for backwards compatability. The data contained in this file is the same as the new SA4 data feed, however, the column names are slightly as SA4-specific terminology has been removed. This legacy feed will continue to be updated. There is no need to switch to the new feed if you've already integrated against the legacy one.

Important note about data quality: This data is provided as-is. I'm not guaranteeing the timeliness or accuracy of any data provided above. Some basic validation steps are present (i.e. we test the data and see if the totals add up to expected values and if there are empty data values), but no manual checks are conducted. Use at your own risk.

The data files above are usually updated daily. GitHub Actions is configured to scrape and extract data from the Department of Health website every 5 minutes and published via GitHub Pages. The data is also available via this git repo in under docs/data.

Documentation for these data files will come in due course.

Want to use this in Google Sheets?

You can use the =IMPORTDATA() formula:

=IMPORTDATA("https://vaccinedata.covid19nearme.com.au/data/all.csv")

To run yourself

You can also run this code yourself. You'll need:

  • Yarn (or NPM) to install JS dependencies
  • Node (not sure what version but I'm running v12.x)
git clone https://github.com/jxeeno/aust-govt-covid19-vaccine-pdf.git
cd aust-govt-covid19-vaccine-pdf
yarn # or: npm install

# prior to 16 Aug 2021
node index.js "https://www.health.gov.au/sites/default/files/documents/2021/04/covid-19-vaccine-rollout-update-19-april-2021.pdf"

# from 16 Aug 2021 onwards
node index.js "https://www.health.gov.au/sites/default/files/documents/2021/08/covid-19-vaccine-rollout-update-19-august-2021.pdf" "https://www.health.gov.au/sites/default/files/documents/2021/08/covid-19-vaccine-rollout-update-jurisdictional-breakdown-19-august-2021.pdf"

Help

Why did you build this?

Because for some reason, our Health department reckons the best way to provide statistical data is in a PDF file generated from Microsoft PowerPoint.

This data should be available in machine-readable formats for transparency and to enable ease of access.

Oh no, it's broken

Yeah, that's probably going to happen. Every time the Health department decides to add some new disclaimers or tweak the layout/wording a little, this thing will break.

You can try and fix it and submit a PR. Or raise an issue and I'll have a look at it.

Why is the code so bad?

Yeah, it's spaghetti code because it's basically disposable code. I expect to need to rewrite this every few days.

Having said that, you're welcome to raise a PR if you want to make it better! :)

Data corrections

Department of Health occasionally updates their historical vaccine statistics to correct incorrect data. See below for summary of changes:

Date Description of changes
2021-05-13 NSW total revised from 266,514 (+10,321) to 264,135 (+7,942)
2021-05-13 National total revised from 2,980,644 (+85,874) to 2,978,265 (+83,495)
2021-05-14 National 24 hour difference revised from +76,153 to +78,532
2021-05-14 NSW 24 hour difference revised from +8,237 to +10,616

About

Repository extracting vaccination statistics from PDFs published by the Department of Health (Australia) in machine-readable formats

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •