Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release NCHS Mortality #367

Closed
6 of 9 tasks
krivard opened this issue Oct 26, 2020 · 50 comments
Closed
6 of 9 tasks

Release NCHS Mortality #367

krivard opened this issue Oct 26, 2020 · 50 comments
Assignees
Labels
Engineering Used to filter issues when synching with Asana release Track the finishing work for features ready for release
Milestone

Comments

@krivard
Copy link
Contributor

krivard commented Oct 26, 2020

This is a new indicator that's completed its first phase of development and is ready to consider for public release. This indicator tracks the number of covid, pneumonia, and all deaths, as well as the percentage of expected deaths, as published by the NCHS. Signal is per week, but the estimate is updated daily.

Edit: we will not pursue map release at this time

@krivard krivard added the release Track the finishing work for features ready for release label Oct 26, 2020
@benjaminysmith
Copy link
Contributor

Thanks Katie -- would you like me to shepherd the sub-tasks? Do you have an example of the process I can copy?

@krivard
Copy link
Contributor Author

krivard commented Oct 27, 2020

I would! Some examples are in the links, but this is still a prototype of the process we'd like to build, so not everything has something to go off of.

The tasks are roughly in order -- eg if the statistical review doesn't go off, we don't want to move forward until that's been fixed.

Jingjing should handle the statistical review. Anyone can handle putting together the signal and source naming recommendations (just send Roni a txt when he should look at the approvals doc and when it's needed by -- 1.11 isn't on the calendar yet so several days allowance is fine). Adding to automation will coordinate with Brian. Visual review, signal description pop-up text, and map release notes will coordinate with Chris. API documentation and mailing list notification will coordinate with Kari, Jingjing, and Alex.

@benjaminysmith
Copy link
Contributor

Fabulous thanks!

@jingjtang looks like you are up for the first task. Ping me if you need help and I can escalate.

@jingjtang
Copy link
Contributor

jingjtang commented Oct 27, 2020

@benjaminysmith Simple comparison between usa-facts deaths_incidence_num vs nchs wip_covid_deaths_num. They are consistent each other in general but difference exists. The extent of difference varies among states.
covid_deaths_num.pdf

As for the correlation APP, it seems there is not a weekly response to use. A simple geo-wise spearman correlation analysis is shown below. Note that there is no lag considered(according to the figures in the pdf). The drop in the last few weeks is reasonable, since deaths usually have significant delay in reports. According to NCHS, this delay can range from 1 week to 8 weeks or more, depending on the jurisdiction and cause of death.
image

@jingjtang
Copy link
Contributor

jingjtang commented Oct 27, 2020

@krivard As for the signal names, the current ones are:

  • covid_deaths_num
  • covid_deaths_prop
  • total_deaths_num
  • total_deaths_prop
  • influenza_deaths_num
  • influenza_deaths_prop
  • pneumonia_deaths_num
  • pneumonia_deaths_prop
  • pneumonia_and_covid_deaths_num
  • pneumonia_and_covid_deaths_prop
  • pneumonia_influenza_or_covid_19_deaths_num
  • pneumonia_influenza_or_covid_19_deaths_prop
  • percent_of_expected_deaths

as described here

what kind of txt do you need?

@jingjtang
Copy link
Contributor

jingjtang commented Oct 27, 2020

Though there is no geo aggregation needed, but we need population info for prop signals, so still need the geo refactor #382 . However, state names (e.g. Alabama, Alaska) are used in the raw dataset which has not been supported by our geo utils. Need @dshemetov 's confirmation. (Edited: new PR merged yesterday, now state names are supported in geo utils)

@jingjtang
Copy link
Contributor

jingjtang commented Oct 27, 2020

Contacted @korlaxxalrok already for the automation. Will work with him once he starts on this.
As for the automation, some important points for this pipeline

  • A token is needed. Either use mine or create another one for Delphi.
  • Code related to export_start_date should be changed according to the final strategy
  • Data versioning staffs (run and store diffs daily but export to API weekly), might need @eujing 's help.

@RoniRos
Copy link
Member

RoniRos commented Oct 27, 2020

@jingjtang :

  1. Is there significance to the use of 'and' vs. 'or' in the signal names above? Does pneumonia_and_covid_deaths_num count deaths that have been classified as due to BOTH pneumonia AND Covid, or those that have been classified as due to Influenza + those classified as due to Covid?

  2. In this description, it's important to clearly mention whether the classification is based only on the primary ICD code (primary cause of death), or on 'any code'.

@jingjtang
Copy link
Contributor

jingjtang commented Oct 27, 2020

@RoniRos

  1. If we consider deaths caused by influenza, pneumonia and covid as different sets. and means Intersection while or means union. So,

    • pneumonia_and_covid_deaths means deaths caused by BOTH pneumonia AND covid.
    • pneumonia_influenza_or_covid_19_deaths means deaths caused by EITHER influenza, pneumonia OR covid
  2. As described in their website. I think they are classified based on 'any code'. (not sure whether it is the primary cause)
    COVID-19 deaths are identified using a new ICD–10 code. When COVID-19 is reported as a cause of death – or when it is listed as a “probable” or “presumed” cause — the death is coded as U07.1. This can include cases with or without laboratory confirmation.

@jingjtang
Copy link
Contributor

@korlaxxalrok for automation

@benjaminysmith
Copy link
Contributor

@RoniRos -- is the clarification on naming acceptable, or would you like to change the AND/OR use?

@krivard is there an approver for the correlations above, or is this enough?

@jingjtang
Copy link
Contributor

API documentation PR here

@benjaminysmith
Copy link
Contributor

The correlation looks pretty convincing to me. @RoniRos can give final approval on this, but otherwise this looks good to go.

If the names are also good I think we have everything we need aside from running it in automation.

@jingjtang
Copy link
Contributor

Visual View for all of the signals here

@RoniRos
Copy link
Member

RoniRos commented Nov 6, 2020

@benjaminysmith I'm very sorry this has been blocked on me. I needed to find time to review the NCHS data definitions.

@RoniRos
Copy link
Member

RoniRos commented Nov 6, 2020

Based on the NCHS explanations and our conventions, the signals need to named as follows:

  • deaths_covid_cumulative_num
  • deaths_covid_cumulative_prop
  • deaths_allcause_cumulative_num
  • deaths_allcause_cumulative_prop
  • deaths_flu_cumulative_num
  • deaths_flu_cumulative_prop
  • deaths_pneumonia_butnotflu_cumulative_num
  • deaths_pneumonia_butnotflu_cumulative_prop
  • deaths_covid_and_pneumonia_butnotflu_cumulative_num
  • deaths_covid_and_pneumonia_butnotflu_cumulative_prop
  • deaths_pneumonia_or_flu_or_covid_cumulative_num
  • deaths_pneumonia_or_flu_or_covid_cumulative_num
  • deaths_percent_of_expected

By way of explanation: it turns out the classifications are _not _ based on the primary cause of death, but on any cause of death, of which there are often several (several ICD codes per individual death). So the logic has to be spelled out more clearly.

Some choices can be debated. I'd love to hear your thoughts @krivard @benjaminysmith @jingjtang :

  • I started with 'deaths' because that's what we do in the existing Covid death signals (although we don't use 'covid' in their name, it is implied).
  • The example on the CDC website shows cumulative counts, so I made it 'cumulative'. If the signal is not cumulative, then 'cumulative' should be replaced by 'incidence' (or do you provide both signals?).
  • 'total' is too generic a term - there are many ways of deriving totals. The technical term is 'all cause', which we could spell as 'all_cause' or 'allcause' or even 'all-cause' (@krivard do we ever use a hyphen in signal names?).
  • 'flu' and 'influenza' are generally equivalent. In formal writing, 'influenza' is usually preferred, because 'flu' in general public usage is used for lots of other things (e.g. "stomach flu", which has nothing to do with influenza). However, these names are awfully long already, so I think it is acceptable and indeed preferable to use 'flu'.
  • "butnotflu" could be spelled "but_not_flu", if you prefer. Again, I was going for simplicity.

@RoniRos
Copy link
Member

RoniRos commented Nov 6, 2020

In this description, it's important to clearly mention whether the classification is based only on the primary ICD code (primary cause of death), or on 'any code'.

@jingjtang please modify the description to explicitly state that the classification is based on all the codes on the death certificate (not just the 'primary cause of death').

@RoniRos
Copy link
Member

RoniRos commented Nov 6, 2020

The correlation looks pretty convincing to me. @RoniRos can give final approval on this,

@benjaminysmith Can you please point me to the correlations?

@jingjtang
Copy link
Contributor

jingjtang commented Nov 6, 2020

@RoniRos They should be incidence not cumulative. You can see a simple comparison between the NCHS covid deaths_num and usa-facts deaths_incidence_num here. Currently we only provide incidence, we can add cumulative ones if we want.

@RoniRos
Copy link
Member

RoniRos commented Nov 6, 2020

@RoniRos They should be incidence not cumulative. You can see a simple comparison between the NCHS covid deaths_num and usa-facts deaths_incidence_num here. Currently we only provide incidence, we can add cumulative ones if we want.

Ok, thanks! Then let's change to 'incidence' throughout.

Yes, I saw this plot, I thought Ben was referring to additional ones.
I'd love to visualize all these signals before we make them public. How can I do that?
We used to be able to call the COVIDcast map with the wip_ signals in a JSON string... I don't think that still works though.
I suppose if you can send me csv files for the most recent issue, that will do.

@jingjtang
Copy link
Contributor

@RoniRos This is not automated yet and the signals cannot be fetched from our API. I am not sure what's the best way to visually show them. Here is a notebook that I created for you to play with.

@benjaminysmith
Copy link
Contributor

@jingjtang -- what do we need to do to resolve this? Do you need additional input?

@RoniRos
Copy link
Member

RoniRos commented Nov 14, 2020

Thanks @jingjtang .

  1. I wasn't concerned about reporting delays, nor about the explanation being inconsistent. I was referring to the observation that in your geo-wise correlation plot:
    image
    The correlation is appropriately high (almost 1.0) for most of the weeks, but not for the first few weeks (w11-w12 and maybe w13). To me this suggests that at either the NCHS or the USAFacts signal have wrong data for these weeks, and I was asking which. I don't think that reporting delays in NCHS make a difference here, because by now there are no more updates to NCHS reports for w11-14. And NCHS data is based on the date of death, not the date of reporting. (@jingjtang : Does USAFacts report deaths by date of death or by the date the death was reported?)

On inspecting your comparison chart (covid_deaths_num.pdf). I think I figured out the reason. There are some surprising zeros in the NCHS counts. Specifically:

  • Alaska (AK) is all zeros, even though Table 2 here shows a cumulative 74 COVID deaths.
  • WY is zero thru w42 (!)
  • HI is zero thru w33
  • MT is zero thru w31
  • Quite a few states are zero thru w14 or w15, even though USAFacts reports cases

So @benjaminysmith I think my question about the glitchy correlation is resolved: the problem is NCHS, not USAFacts. But this is probably something we should try to resolve. For example, why does our NCHS signal reports all zeros for AK when the CDC table shows cumulative 74 deaths? And are the prolonged zeros for HI, MT & others correspond to CDC's reporting?

@krivard
Copy link
Contributor Author

krivard commented Nov 16, 2020

Is it likely that this is related to the file format problems with the early NCHS data files dated up to ~May? I inquired with Matthew but never got a response, and we eventually just went forward with the better file format that became available starting June. (Sparse) context in this thread.

@RoniRos
Copy link
Member

RoniRos commented Nov 16, 2020

The weak correlations for w11-13 may be explained that way, but not the zeros for AK, WY,HI,MT. I think we need to resolve this before we publish -- we shouldn't publish zeros for AK, for example.
@jingjtang can you please investigate why e.g. AK is still showing zeros, even though the NCHS table shows cumulative 74 deaths? Please point me to the location in NCHS where the zeros originate.
@krivard by Matthew do you mean Matt Biggerstaff at CDC? I now have a more direct connection to NCHS, so I could try with them, but before I do I want to make sure I have all the fact right.

@jingjtang
Copy link
Contributor

@RoniRos Just fixed a bug in the pipeline that I incorrect put a one week shift previously #528. After fixing this issue:

So according to the comparison with incidence numbers in USA-Facts, we should treat their(nchs's) weekly report as incidence numbers. There are NAs existing in the raw dataset. I treat them as 0 when initially write the pipeline. But it is incorrect especially for states with a lot of missing values such as "AK", "DE", "HI", "ME", "NC", "WY" for either a certain time range or the entire published time range. I should keep those NAs, definitely.
If keeping all of the NAs, and drop regions where one of the signal(nchs_mortality deaths_covid_num, usa-facts deaths_incidence_num) has an NA when calculating the spearman correlation, the result would show like this
image
In W11, only WA has 27 reported deaths in NCHS mortality, only WA has 24 deaths and LA has 1 deaths in USA-Facts.

It's correct that they have 74 cases in Table 2 for AK, but it is also correct that they only have 0 or NA for AK in table 1 all the time (similar thing happens to WY, the first non-zero/non-NA value shows in W42 with week-ending date 2020-10-17). @RoniRos You can directly download the most recent raw dataset from here (click "export" and then "CSV"). So, it seems they have 74 recorded COVID related deaths until 11-14 but not sure which week they should assign those deaths to.

@RoniRos
Copy link
Member

RoniRos commented Nov 17, 2020

I think I figured out the reason for the discrepancy: any time the value is missing (N/A), it corresponds to a positive count less than 10 (i.e., somewhere in 1-9), which is censored by privacy rules. For low-population, low-covid states like Alaska or Wyoming, this happens most of the time. For some of the other states, it happens occasionally.
We need to figure out how to handle this. These counts are not completely missing -- they may be reconstructed, at least partially, from NCHS cumulative values (Table 2, which gave us 74 cases for AK). Have we been storing prior versions of Table 2? @brookslogan @jingjtang ?

@jingjtang
Copy link
Contributor

Didn't find a download link for Table2, but here is a county level table for COVID-related deaths only. According to their API docs, we should be able to specify the data_as_of, however, I cannot fetch the data in old versions. (I changed the data_as_of in https://data.cdc.gov/resource/kn79-hsxy.json?data_as_of=2020-11-12T00:00:00.000, but got nothing) @brookslogan Have you played with the Socrata API to get old versions of data?

@jingjtang
Copy link
Contributor

jingjtang commented Nov 17, 2020

@benjaminysmith Draft for the signal description pop-up text added under Name: Percent of Expected Deaths. Since this is our first weekly signal, it will affect some of our map settings.

@krivard
Copy link
Contributor Author

krivard commented Nov 17, 2020

when I looked at Socrata before, data_as_of did nothing. Here's a Slack thread where Logan explains that it's not a first-class feature of their API.

I'd forgotten about the frontend support for weeklies 😱 we may end up pushing this to 1.12 in the map, but at least we can release in the API with the current batch.

@benjaminysmith
Copy link
Contributor

Summarizing the last few comments:

  • We are not currently able to reconstruct the data because of the API limitations. It sounds like the default way for us to handle these for now is to fill these with N/A. @jingjtang is that an accurate assessment? If that is the simplest and we still think the data is useful, we could start with this and follow up to see if we can reconstruct later.
  • We will need to follow up with map support for weekly data. @tildechris as FYI.

@capnrefsmmat
Copy link
Contributor

Will this signal have time_type = "week" in the API? If so, we'll need to update the API clients to support it as well, and I'll need to file an issue for that.

@jingjtang
Copy link
Contributor

@benjaminysmith Yes, since currently we don't have another data source that can handle the missing values due to privacy issue, we just want to keep them as NAN, at least for now I think @krivard .
And a PR here for this.

@tildechris
Copy link

Regarding map support, we'd need to know an ETA of when this would be available in staging to schedule the work to pull it in.

@benjaminysmith
Copy link
Contributor

API support sounds like a blocker for release. Filed a request for this in cmu-delphi/covidcast#305

@krivard
Copy link
Contributor Author

krivard commented Nov 19, 2020

*covidcast client support.

The API server has no problem with a weekly time type, and neither do the low-level clients in delphi-epidata. This may be a topic for team leads this afternoon.

@RoniRos
Copy link
Member

RoniRos commented Nov 19, 2020

Representing the censored values as N/A seems like the most reasonable thing to do for now. Eventually, we may want to distinguish them from truly missing. Do our tools support different types of missingness?

@RoniRos
Copy link
Member

RoniRos commented Nov 19, 2020

Also, we need to think of how to represent on the map zero vs. censored vs. unknown.

@krivard
Copy link
Contributor Author

krivard commented Nov 19, 2020

Verdict from leads: Go ahead with the release in the API; the client and map will catch up.

@RoniRos We do not yet encode different kinds of missingness, it's on the list for December.

@SumitDELPHI SumitDELPHI added the Engineering Used to filter issues when synching with Asana label Dec 6, 2020
@korlaxxalrok
Copy link
Contributor

@krivard This is on prod now. It should have delivered some data yesterday, but otherwise will run daily, build up its cache, and then output on Mondays.

@krivard
Copy link
Contributor Author

krivard commented Dec 8, 2020

The API docs got lost along the way but I recovered them in cmu-delphi/delphi-epidata#315.

@jingjtang would you draft a mailing list announcement and drop it in a comment here? Then @Akvannortwick can polish it up for distribution and maybe we can announce this puppy tomorrow.

@Akvannortwick
Copy link

You can also draft an email here and I will make changes and suggestions.

@jingjtang
Copy link
Contributor

@krivard @Akvannortwick Except for the descriptions for all related signals, if there anything else to be added?

Besides, I made some comments under cmu-delphi/delphi-epidata#315

@krivard
Copy link
Contributor Author

krivard commented Dec 10, 2020

Released today!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Engineering Used to filter issues when synching with Asana release Track the finishing work for features ready for release
Projects
None yet
Development

No branches or pull requests

9 participants