Release NCHS Mortality #367

krivard · 2020-10-26T20:49:50Z

Link to issue
Link to PR
Proposed release version: 1.11

This is a new indicator that's completed its first phase of development and is ready to consider for public release. This indicator tracks the number of covid, pneumonia, and all deaths, as well as the percentage of expected deaths, as published by the NCHS. Signal is per week, but the estimate is updated daily.

Edit: we will not pursue map release at this time

Statistical review (usually correlations)
Signal / source name review (usually Roni)
Add to Delphi Automation
API support for weekly data
Visual review
~~Signal description pop-up text drafting and review~~
~~Map release notes~~
API documentation and/or changelog
API mailing list notification

benjaminysmith · 2020-10-27T13:54:18Z

Thanks Katie -- would you like me to shepherd the sub-tasks? Do you have an example of the process I can copy?

krivard · 2020-10-27T14:03:00Z

I would! Some examples are in the links, but this is still a prototype of the process we'd like to build, so not everything has something to go off of.

The tasks are roughly in order -- eg if the statistical review doesn't go off, we don't want to move forward until that's been fixed.

Jingjing should handle the statistical review. Anyone can handle putting together the signal and source naming recommendations (just send Roni a txt when he should look at the approvals doc and when it's needed by -- 1.11 isn't on the calendar yet so several days allowance is fine). Adding to automation will coordinate with Brian. Visual review, signal description pop-up text, and map release notes will coordinate with Chris. API documentation and mailing list notification will coordinate with Kari, Jingjing, and Alex.

benjaminysmith · 2020-10-27T14:05:55Z

Fabulous thanks!

@jingjtang looks like you are up for the first task. Ping me if you need help and I can escalate.

jingjtang · 2020-10-27T17:49:48Z

@benjaminysmith Simple comparison between usa-facts deaths_incidence_num vs nchs wip_covid_deaths_num. They are consistent each other in general but difference exists. The extent of difference varies among states.
covid_deaths_num.pdf

As for the correlation APP, it seems there is not a weekly response to use. A simple geo-wise spearman correlation analysis is shown below. Note that there is no lag considered(according to the figures in the pdf). The drop in the last few weeks is reasonable, since deaths usually have significant delay in reports. According to NCHS, this delay can range from 1 week to 8 weeks or more, depending on the jurisdiction and cause of death.

jingjtang · 2020-10-27T17:55:06Z

@krivard As for the signal names, the current ones are:

covid_deaths_num
covid_deaths_prop
total_deaths_num
total_deaths_prop
influenza_deaths_num
influenza_deaths_prop
pneumonia_deaths_num
pneumonia_deaths_prop
pneumonia_and_covid_deaths_num
pneumonia_and_covid_deaths_prop
pneumonia_influenza_or_covid_19_deaths_num
pneumonia_influenza_or_covid_19_deaths_prop
percent_of_expected_deaths

as described here

what kind of txt do you need?

jingjtang · 2020-10-27T17:58:10Z

Though there is no geo aggregation needed, but we need population info for prop signals, so still need the geo refactor #382 . However, state names (e.g. Alabama, Alaska) are used in the raw dataset which has not been supported by our geo utils. Need @dshemetov 's confirmation. (Edited: new PR merged yesterday, now state names are supported in geo utils)

jingjtang · 2020-10-27T18:07:09Z

Contacted @korlaxxalrok already for the automation. Will work with him once he starts on this.
As for the automation, some important points for this pipeline

A token is needed. Either use mine or create another one for Delphi.
Code related to export_start_date should be changed according to the final strategy
Data versioning staffs (run and store diffs daily but export to API weekly), might need @eujing 's help.

RoniRos · 2020-10-27T18:32:23Z

@jingjtang :

Is there significance to the use of 'and' vs. 'or' in the signal names above? Does pneumonia_and_covid_deaths_num count deaths that have been classified as due to BOTH pneumonia AND Covid, or those that have been classified as due to Influenza + those classified as due to Covid?
In this description, it's important to clearly mention whether the classification is based only on the primary ICD code (primary cause of death), or on 'any code'.

jingjtang · 2020-10-27T19:04:35Z

@RoniRos

If we consider deaths caused by influenza, pneumonia and covid as different sets. and means Intersection while or means union. So,
- pneumonia_and_covid_deaths means deaths caused by BOTH pneumonia AND covid.
- pneumonia_influenza_or_covid_19_deaths means deaths caused by EITHER influenza, pneumonia OR covid
As described in their website. I think they are classified based on 'any code'. (not sure whether it is the primary cause)
COVID-19 deaths are identified using a new ICD–10 code. When COVID-19 is reported as a cause of death – or when it is listed as a “probable” or “presumed” cause — the death is coded as U07.1. This can include cases with or without laboratory confirmation.

jingjtang · 2020-10-29T16:51:01Z

@korlaxxalrok for automation

benjaminysmith · 2020-10-29T18:04:37Z

@RoniRos -- is the clarification on naming acceptable, or would you like to change the AND/OR use?

@krivard is there an approver for the correlations above, or is this enough?

jingjtang · 2020-10-29T22:07:11Z

API documentation PR here

benjaminysmith · 2020-11-05T02:03:26Z

The correlation looks pretty convincing to me. @RoniRos can give final approval on this, but otherwise this looks good to go.

If the names are also good I think we have everything we need aside from running it in automation.

jingjtang · 2020-11-06T00:24:24Z

Visual View for all of the signals here

RoniRos · 2020-11-06T03:37:52Z

@benjaminysmith I'm very sorry this has been blocked on me. I needed to find time to review the NCHS data definitions.

RoniRos · 2020-11-06T04:18:10Z

Based on the NCHS explanations and our conventions, the signals need to named as follows:

deaths_covid_cumulative_num
deaths_covid_cumulative_prop
deaths_allcause_cumulative_num
deaths_allcause_cumulative_prop
deaths_flu_cumulative_num
deaths_flu_cumulative_prop
deaths_pneumonia_butnotflu_cumulative_num
deaths_pneumonia_butnotflu_cumulative_prop
deaths_covid_and_pneumonia_butnotflu_cumulative_num
deaths_covid_and_pneumonia_butnotflu_cumulative_prop
deaths_pneumonia_or_flu_or_covid_cumulative_num
deaths_pneumonia_or_flu_or_covid_cumulative_num
deaths_percent_of_expected

By way of explanation: it turns out the classifications are _not _ based on the primary cause of death, but on any cause of death, of which there are often several (several ICD codes per individual death). So the logic has to be spelled out more clearly.

Some choices can be debated. I'd love to hear your thoughts @krivard @benjaminysmith @jingjtang :

I started with 'deaths' because that's what we do in the existing Covid death signals (although we don't use 'covid' in their name, it is implied).
The example on the CDC website shows cumulative counts, so I made it 'cumulative'. If the signal is not cumulative, then 'cumulative' should be replaced by 'incidence' (or do you provide both signals?).
'total' is too generic a term - there are many ways of deriving totals. The technical term is 'all cause', which we could spell as 'all_cause' or 'allcause' or even 'all-cause' (@krivard do we ever use a hyphen in signal names?).
'flu' and 'influenza' are generally equivalent. In formal writing, 'influenza' is usually preferred, because 'flu' in general public usage is used for lots of other things (e.g. "stomach flu", which has nothing to do with influenza). However, these names are awfully long already, so I think it is acceptable and indeed preferable to use 'flu'.
"butnotflu" could be spelled "but_not_flu", if you prefer. Again, I was going for simplicity.

RoniRos · 2020-11-06T04:20:13Z

In this description, it's important to clearly mention whether the classification is based only on the primary ICD code (primary cause of death), or on 'any code'.

@jingjtang please modify the description to explicitly state that the classification is based on all the codes on the death certificate (not just the 'primary cause of death').

RoniRos · 2020-11-06T04:21:53Z

The correlation looks pretty convincing to me. @RoniRos can give final approval on this,

@benjaminysmith Can you please point me to the correlations?

jingjtang · 2020-11-06T04:25:08Z

@RoniRos They should be incidence not cumulative. You can see a simple comparison between the NCHS covid deaths_num and usa-facts deaths_incidence_num here. Currently we only provide incidence, we can add cumulative ones if we want.

RoniRos · 2020-11-06T04:30:32Z

@RoniRos They should be incidence not cumulative. You can see a simple comparison between the NCHS covid deaths_num and usa-facts deaths_incidence_num here. Currently we only provide incidence, we can add cumulative ones if we want.

Ok, thanks! Then let's change to 'incidence' throughout.

Yes, I saw this plot, I thought Ben was referring to additional ones.
I'd love to visualize all these signals before we make them public. How can I do that?
We used to be able to call the COVIDcast map with the wip_ signals in a JSON string... I don't think that still works though.
I suppose if you can send me csv files for the most recent issue, that will do.

jingjtang · 2020-11-06T04:34:08Z

@RoniRos This is not automated yet and the signals cannot be fetched from our API. I am not sure what's the best way to visually show them. Here is a notebook that I created for you to play with.

benjaminysmith · 2020-11-13T19:03:47Z

@jingjtang -- what do we need to do to resolve this? Do you need additional input?

RoniRos · 2020-11-14T21:59:18Z

Thanks @jingjtang .

I wasn't concerned about reporting delays, nor about the explanation being inconsistent. I was referring to the observation that in your geo-wise correlation plot:

The correlation is appropriately high (almost 1.0) for most of the weeks, but not for the first few weeks (w11-w12 and maybe w13). To me this suggests that at either the NCHS or the USAFacts signal have wrong data for these weeks, and I was asking which. I don't think that reporting delays in NCHS make a difference here, because by now there are no more updates to NCHS reports for w11-14. And NCHS data is based on the date of death, not the date of reporting. (@jingjtang : Does USAFacts report deaths by date of death or by the date the death was reported?)

On inspecting your comparison chart (covid_deaths_num.pdf). I think I figured out the reason. There are some surprising zeros in the NCHS counts. Specifically:

Alaska (AK) is all zeros, even though Table 2 here shows a cumulative 74 COVID deaths.
WY is zero thru w42 (!)
HI is zero thru w33
MT is zero thru w31
Quite a few states are zero thru w14 or w15, even though USAFacts reports cases

So @benjaminysmith I think my question about the glitchy correlation is resolved: the problem is NCHS, not USAFacts. But this is probably something we should try to resolve. For example, why does our NCHS signal reports all zeros for AK when the CDC table shows cumulative 74 deaths? And are the prolonged zeros for HI, MT & others correspond to CDC's reporting?

krivard · 2020-11-16T16:17:56Z

Is it likely that this is related to the file format problems with the early NCHS data files dated up to ~May? I inquired with Matthew but never got a response, and we eventually just went forward with the better file format that became available starting June. (Sparse) context in this thread.

RoniRos · 2020-11-16T18:39:02Z

The weak correlations for w11-13 may be explained that way, but not the zeros for AK, WY,HI,MT. I think we need to resolve this before we publish -- we shouldn't publish zeros for AK, for example.
@jingjtang can you please investigate why e.g. AK is still showing zeros, even though the NCHS table shows cumulative 74 deaths? Please point me to the location in NCHS where the zeros originate.
@krivard by Matthew do you mean Matt Biggerstaff at CDC? I now have a more direct connection to NCHS, so I could try with them, but before I do I want to make sure I have all the fact right.

jingjtang · 2020-11-16T19:54:10Z

@RoniRos Just fixed a bug in the pipeline that I incorrect put a one week shift previously #528. After fixing this issue:

The current comparison is shown like this
compare_with_usafacts.pdf
The geo-wise correlation is shown like this
Fixed state-level visual view

So according to the comparison with incidence numbers in USA-Facts, we should treat their(nchs's) weekly report as incidence numbers. There are NAs existing in the raw dataset. I treat them as 0 when initially write the pipeline. But it is incorrect especially for states with a lot of missing values such as "AK", "DE", "HI", "ME", "NC", "WY" for either a certain time range or the entire published time range. I should keep those NAs, definitely.
If keeping all of the NAs, and drop regions where one of the signal(nchs_mortality deaths_covid_num, usa-facts deaths_incidence_num) has an NA when calculating the spearman correlation, the result would show like this

In W11, only WA has 27 reported deaths in NCHS mortality, only WA has 24 deaths and LA has 1 deaths in USA-Facts.

It's correct that they have 74 cases in Table 2 for AK, but it is also correct that they only have 0 or NA for AK in table 1 all the time (similar thing happens to WY, the first non-zero/non-NA value shows in W42 with week-ending date 2020-10-17). @RoniRos You can directly download the most recent raw dataset from here (click "export" and then "CSV"). So, it seems they have 74 recorded COVID related deaths until 11-14 but not sure which week they should assign those deaths to.

RoniRos · 2020-11-17T04:08:33Z

I think I figured out the reason for the discrepancy: any time the value is missing (N/A), it corresponds to a positive count less than 10 (i.e., somewhere in 1-9), which is censored by privacy rules. For low-population, low-covid states like Alaska or Wyoming, this happens most of the time. For some of the other states, it happens occasionally.
We need to figure out how to handle this. These counts are not completely missing -- they may be reconstructed, at least partially, from NCHS cumulative values (Table 2, which gave us 74 cases for AK). Have we been storing prior versions of Table 2? @brookslogan @jingjtang ?

jingjtang · 2020-11-17T04:52:30Z

Didn't find a download link for Table2, but here is a county level table for COVID-related deaths only. According to their API docs, we should be able to specify the data_as_of, however, I cannot fetch the data in old versions. (I changed the data_as_of in https://data.cdc.gov/resource/kn79-hsxy.json?data_as_of=2020-11-12T00:00:00.000, but got nothing) @brookslogan Have you played with the Socrata API to get old versions of data?

jingjtang · 2020-11-17T19:07:59Z

@benjaminysmith Draft for the signal description pop-up text added under Name: Percent of Expected Deaths. Since this is our first weekly signal, it will affect some of our map settings.

krivard · 2020-11-17T19:57:43Z

when I looked at Socrata before, data_as_of did nothing. Here's a Slack thread where Logan explains that it's not a first-class feature of their API.

I'd forgotten about the frontend support for weeklies 😱 we may end up pushing this to 1.12 in the map, but at least we can release in the API with the current batch.

benjaminysmith · 2020-11-18T01:45:35Z

Summarizing the last few comments:

We are not currently able to reconstruct the data because of the API limitations. It sounds like the default way for us to handle these for now is to fill these with N/A. @jingjtang is that an accurate assessment? If that is the simplest and we still think the data is useful, we could start with this and follow up to see if we can reconstruct later.
We will need to follow up with map support for weekly data. @tildechris as FYI.

capnrefsmmat · 2020-11-18T15:56:48Z

Will this signal have time_type = "week" in the API? If so, we'll need to update the API clients to support it as well, and I'll need to file an issue for that.

jingjtang · 2020-11-18T16:01:32Z

@benjaminysmith Yes, since currently we don't have another data source that can handle the missing values due to privacy issue, we just want to keep them as NAN, at least for now I think @krivard .
And a PR here for this.

tildechris · 2020-11-18T18:22:15Z

Regarding map support, we'd need to know an ETA of when this would be available in staging to schedule the work to pull it in.

benjaminysmith · 2020-11-18T20:13:10Z

API support sounds like a blocker for release. Filed a request for this in cmu-delphi/covidcast#305

krivard · 2020-11-19T16:31:32Z

*covidcast client support.

The API server has no problem with a weekly time type, and neither do the low-level clients in delphi-epidata. This may be a topic for team leads this afternoon.

RoniRos · 2020-11-19T20:37:46Z

Representing the censored values as N/A seems like the most reasonable thing to do for now. Eventually, we may want to distinguish them from truly missing. Do our tools support different types of missingness?

RoniRos · 2020-11-19T20:39:19Z

Also, we need to think of how to represent on the map zero vs. censored vs. unknown.

krivard · 2020-11-19T20:52:32Z

Verdict from leads: Go ahead with the release in the API; the client and map will catch up.

@RoniRos We do not yet encode different kinds of missingness, it's on the list for December.

korlaxxalrok · 2020-12-08T17:40:46Z

@krivard This is on prod now. It should have delivered some data yesterday, but otherwise will run daily, build up its cache, and then output on Mondays.

krivard · 2020-12-08T22:52:19Z

The API docs got lost along the way but I recovered them in cmu-delphi/delphi-epidata#315.

@jingjtang would you draft a mailing list announcement and drop it in a comment here? Then @Akvannortwick can polish it up for distribution and maybe we can announce this puppy tomorrow.

Akvannortwick · 2020-12-08T23:02:59Z

You can also draft an email here and I will make changes and suggestions.

jingjtang · 2020-12-09T15:01:14Z

@krivard @Akvannortwick Except for the descriptions for all related signals, if there anything else to be added?

Besides, I made some comments under cmu-delphi/delphi-epidata#315

krivard · 2020-12-10T22:13:56Z

Released today!

krivard added the release Track the finishing work for features ready for release label Oct 26, 2020

krivard assigned jingjtang and benjaminysmith Oct 26, 2020

benjaminysmith mentioned this issue Oct 30, 2020

Release 7-day average signals for Safegraph #332

Closed

7 tasks

krivard assigned Akvannortwick Oct 30, 2020

krivard mentioned this issue Oct 30, 2020

add docs for nchs-mortality cmu-delphi/delphi-epidata#264

Closed

3 tasks

krivard added this to the 1.11 milestone Oct 30, 2020

jingjtang mentioned this issue Nov 17, 2020

Fix nchs missing value #535

Merged

benjaminysmith mentioned this issue Nov 18, 2020

Need support for requesting weekly data to R and Python clients for safegraph patterns cmu-delphi/covidcast#305

Closed

SumitDELPHI added the Engineering Used to filter issues when synching with Asana label Dec 6, 2020

korlaxxalrok mentioned this issue Dec 7, 2020

Release nchs_mortality to production #605

Merged

krivard closed this as completed Dec 10, 2020

nmdefries mentioned this issue Apr 26, 2024

nssp pipeline code #1952

Merged

Release NCHS Mortality #367

Release NCHS Mortality #367

Comments

krivard commented Oct 26, 2020 • edited Loading

benjaminysmith commented Oct 27, 2020

krivard commented Oct 27, 2020

benjaminysmith commented Oct 27, 2020

jingjtang commented Oct 27, 2020 • edited Loading

jingjtang commented Oct 27, 2020 • edited Loading

jingjtang commented Oct 27, 2020 • edited Loading

jingjtang commented Oct 27, 2020 • edited Loading

RoniRos commented Oct 27, 2020

jingjtang commented Oct 27, 2020 • edited Loading

jingjtang commented Oct 29, 2020

benjaminysmith commented Oct 29, 2020

jingjtang commented Oct 29, 2020

benjaminysmith commented Nov 5, 2020

jingjtang commented Nov 6, 2020

RoniRos commented Nov 6, 2020 • edited Loading

RoniRos commented Nov 6, 2020

RoniRos commented Nov 6, 2020

RoniRos commented Nov 6, 2020 • edited Loading

jingjtang commented Nov 6, 2020 • edited Loading

RoniRos commented Nov 6, 2020 • edited Loading

jingjtang commented Nov 6, 2020

benjaminysmith commented Nov 13, 2020

RoniRos commented Nov 14, 2020 • edited Loading

krivard commented Nov 16, 2020

RoniRos commented Nov 16, 2020

jingjtang commented Nov 16, 2020

RoniRos commented Nov 17, 2020

jingjtang commented Nov 17, 2020

jingjtang commented Nov 17, 2020 • edited Loading

krivard commented Nov 17, 2020

benjaminysmith commented Nov 18, 2020

capnrefsmmat commented Nov 18, 2020

jingjtang commented Nov 18, 2020

tildechris commented Nov 18, 2020

benjaminysmith commented Nov 18, 2020

krivard commented Nov 19, 2020

RoniRos commented Nov 19, 2020

RoniRos commented Nov 19, 2020

krivard commented Nov 19, 2020

korlaxxalrok commented Dec 8, 2020

krivard commented Dec 8, 2020

Akvannortwick commented Dec 8, 2020

jingjtang commented Dec 9, 2020

krivard commented Dec 10, 2020

krivard commented Oct 26, 2020 •

edited

Loading

jingjtang commented Oct 27, 2020 •

edited

Loading

jingjtang commented Oct 27, 2020 •

edited

Loading

jingjtang commented Oct 27, 2020 •

edited

Loading

jingjtang commented Oct 27, 2020 •

edited

Loading

jingjtang commented Oct 27, 2020 •

edited

Loading

RoniRos commented Nov 6, 2020 •

edited

Loading

RoniRos commented Nov 6, 2020 •

edited

Loading

jingjtang commented Nov 6, 2020 •

edited

Loading

RoniRos commented Nov 6, 2020 •

edited

Loading

RoniRos commented Nov 14, 2020 •

edited

Loading

jingjtang commented Nov 17, 2020 •

edited

Loading