Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

covidcast has issue "update" data that re-reports the previous version of observations #1362

Closed
brookslogan opened this issue Nov 10, 2021 · 2 comments
Assignees
Labels
data quality Missing data, weird data, broken data

Comments

@brookslogan
Copy link

brookslogan commented Nov 10, 2021

Actual Behavior:

The covidcast issue data includes rows where all entries besides issue and lag appear to be the same as in the preceding version of that row.

Reprex with commentary:

[2022-02-14: Note that this snippet's original code is no longer supported: as_of + issues cannot be used together. I've updated it with something that hopefully is equivalent, but gives 10566 update rows rather than 10465, perhaps due to other version data patches?]

library("tibble")
library("dplyr")

analysis.date = as.Date("2021-11-09")

geo.values = c("ak", "al")

## jhu.case.updates.original.code =
##  delphi.epidata::covidcast("jhu-csse", "confirmed_incidence_num",
##                            "day", "state",
##                            delphi.epidata::epirange(12340101,34560101), geo.values,
##                            as_of = format(analysis.date-1L, "%Y%m%d"), # try to make this very reproducible
##                            issues = delphi.epidata::epirange(12340101,34560101)) %>%
##  delphi.epidata::fetch_tbl()
jhu.case.updates =
  epidatr::covidcast("jhu-csse", "confirmed_incidence_num",
                             "state", "day",
                            geo.values, epidatr::epirange(12340101,34560101),
                            issues = epidatr::epirange(12340101,format(analysis.date-1L, "%Y%m%d"))) %>%
  epidatr::fetch_tbl()


## See if we have any issue data that repeats the same `value` for an
## observation as the preceding issue (or issues) by performing an RLE of the
## value across issues for each observation (geo_value x time_value)
jhu.case.updates %>%
  group_by(geo_value, time_value) %>%
  arrange(issue, .by_group=TRUE) %>%
  summarize(value.rle.tbl = {
    value.rle = rle(value)
    tibble(length = value.rle[["lengths"]], value = value.rle[["value"]])
  }, .groups="drop") %>%
  mutate(
    value.run.length = value.rle.tbl[["length"]],
    value.run.value = value.rle.tbl[["value"]],
    value.rle.tbl = NULL
  ) %>%
  arrange(-value.run.length) %>%
  print()

## We see that there are updates that don't update the value, but maybe
## something else (besides `issue`&`lag`) could have been updated. Let's check
## on the other columns.
## - Examine the observation with the longest run in more detail.
jhu.case.updates %>%
  filter(geo_value=="ak", time_value==as.Date("2020-04-23")) %>%
  arrange(issue) %>%
  select(issue, lag, value, missing_value, stderr, missing_stderr, sample_size, missing_sample_size) %>%
  count(across(-c(issue, lag))) %>%
  print(n=100L)
##   There are two versions of the observation factoring in these additional
##   columns. However, since one of these only appears 1x and the other 56x,
##   there must be at least 54 instances of re-reporting the same row (with
##   different `issue`&`lag`)

## Still, to avoid the complications above regarding the other columns and the
## row ordering, let's try to directly detect re-reporting of "entire
## observations":
rereporting =
  jhu.case.updates %>%
  select(-lag) %>%
  group_by(geo_value, time_value) %>%
  arrange(issue, .by_group=TRUE) %>%
  ## "lag the row" by `lead`ing the `issue`
  mutate(issue = lead(issue)) %>% filter(!is.na(issue)) %>% ungroup() %>%
  ## find the re-reporting:
  inner_join(., jhu.case.updates %>% select(-lag), by=names(.))

rereporting %>%
  nrow()
## 8212 re-reported rows in this extract

jhu.case.updates %>%
  nrow()
## 10465 total rows in this extract

## So, at least for ak&al cases, the amount of re-reporting appears substantial.
## If duplicate rows continue to be steadily re-reported, we will expect "very"
## quadratic growth in the size of the data set; with sparser re-reporting and
## sparse revisions, we would expect "linear-ish" growth.
rereporting %>%
  count(issue) %>%
  arrange(issue) %>%
  print(n=100L)
## The last issue re-reporting an observation appears to be 2020-10-30 for these
## states, so it looks like there isn't the "very" quadratic growth, at least
## for the current signal, geo_type, and geo_values.

Expected behavior

I soft-expected an issue-query to return only data with changes to entries (other than to issue itself and lag), for two reasons:

  • ?covidcast::covidcast_signal describes issue-queries as "[f]etch[ing] only data that was published or updated ("issued") on these [issues]", which might leave this impression.
  • storage efficiency-wise, this would make sense; query efficiency-wise, I'm not sure but I'd guess that it'd probably help, especially if there is an index over source,signal,geo_value,time_value,issue.

Context

I am working on cmu-delphi/epiprocess#23 to build some utilities for working with data version history from delphi-epidata or elsewhere. A natural example was to use version history from something in delphi-epidata using an issue-query, which unearthed this surprise.

The re-reporting isn't really a problem for me right now. The utilities I am working on should be built to accept "update" or snapshot (as-of-query) data that contains duplicate data, and to eventually to compact the data to remove these duplicates. Removing re-reporting in covidcast issue data wouldn't ensure that users wouldn't input such re-reported data from other data providers.

@brookslogan brookslogan added the data quality Missing data, weird data, broken data label Nov 10, 2021
@krivard
Copy link
Contributor

krivard commented Nov 10, 2021

It sounds like what is happening here is that some issues are diff-type (containing only new or changed values) but surprisingly large number of issues are confirmation-type (containing all values for all dates, regions, and signals, even if they haven't changed).

The large JHU patch applied last week contained one confirmation-type issue (20210401) and 203 diff-type issues (20210401 to 20211021), so it's unlikely that the recent repairs caused (the bulk of) this problem.

Confirmation-type issues are expected under certain circumstances:

  1. Certain sources always use confirmation-type issues (fb-survey, doctor-visits, hospital-admissions)
  2. All issues for all sources prior to 2020-07-16 were confirmation-type, since that's how we generated the initial set of versions when versioning was launched. Indicators were switched to diff-based issues one at a time afterward (with poor bookkeeping as to the timing of each switch)
  3. With the exception of the bigint patches, all data patches applied prior to November 2021 were confirmation-type, since setting up the machinery for a diff-type patch is fairly complicated and we only recently established software to assist with the process

You particular example (confirmed_incidence_num,state:ak, day:20200423) falls primarily under case (2) above: 52 of the 57 confirmation issues occurred daily up to 2020-08-14; 4 more occurred during fall 2020 and were likely repairs; the final one was from last week's patch.

MariaDB [epidata]> select issue, value, from_unixtime(value_updated_timestamp) from covidcast where source="jhu-csse" and `signal`="confirmed_incidence_num" and time_type="day" and time_value=20200423 and geo_type="state" and geo_value="ak";
+----------+-------+----------------------------------------+
| issue    | value | from_unixtime(value_updated_timestamp) |
+----------+-------+----------------------------------------+
| 20200424 |     2 | 2021-05-05 14:57:09                    |
| 20200507 |     2 | 2020-05-02 16:00:58                    |
| 20200514 |     2 | 2020-05-13 13:13:41                    |
| 20200621 |     2 | 2020-06-20 10:43:21                    |
| 20200625 |     2 | 2020-06-25 04:05:07                    |
| 20200626 |     2 | 2020-06-26 04:06:55                    |
| 20200627 |     2 | 2020-06-27 04:06:53                    |
| 20200628 |     2 | 2020-06-28 04:07:16                    |
| 20200629 |     2 | 2020-06-29 04:07:12                    |
| 20200702 |     2 | 2020-07-01 11:07:38                    |
| 20200703 |     2 | 2020-07-02 11:07:48                    |
| 20200704 |     2 | 2020-07-03 11:07:32                    |
| 20200705 |     2 | 2020-07-04 11:07:47                    |
| 20200706 |     2 | 2020-07-05 11:08:11                    |
| 20200707 |     2 | 2020-07-06 11:07:54                    |
| 20200708 |     2 | 2020-07-07 11:08:26                    |
| 20200709 |     2 | 2020-07-08 11:08:18                    |
| 20200710 |     2 | 2020-07-09 10:49:05                    |
| 20200711 |     2 | 2020-07-10 11:08:46                    |
| 20200712 |     2 | 2020-07-11 11:08:41                    |
| 20200713 |     2 | 2020-07-12 11:09:00                    |
| 20200714 |     2 | 2020-07-13 11:08:48                    |
| 20200715 |     2 | 2020-07-14 11:09:08                    |
| 20200716 |     2 | 2020-07-16 11:10:07                    |
| 20200717 |     2 | 2020-07-17 11:16:20                    |
| 20200718 |     2 | 2020-07-18 11:15:13                    |
| 20200719 |     2 | 2020-07-19 11:15:58                    |
| 20200720 |     2 | 2020-07-20 11:16:06                    |
| 20200721 |     2 | 2020-07-21 11:17:32                    |
| 20200722 |     2 | 2020-07-22 11:16:56                    |
| 20200723 |     2 | 2020-07-23 11:16:42                    |
| 20200724 |     2 | 2020-07-24 11:08:02                    |
| 20200725 |     2 | 2020-07-25 11:09:17                    |
| 20200726 |     2 | 2020-07-26 11:09:55                    |
| 20200727 |     2 | 2020-07-27 11:10:34                    |
| 20200728 |     2 | 2020-07-28 11:11:27                    |
| 20200729 |     2 | 2020-07-29 11:13:28                    |
| 20200730 |     2 | 2020-07-30 11:13:29                    |
| 20200731 |     2 | 2020-07-31 11:12:49                    |
| 20200802 |     2 | 2020-08-02 11:12:37                    |
| 20200803 |     2 | 2020-08-03 10:44:59                    |
| 20200804 |     2 | 2020-08-04 11:12:50                    |
| 20200805 |     2 | 2020-08-05 11:13:24                    |
| 20200806 |     2 | 2020-08-06 13:23:58                    |
| 20200807 |     2 | 2020-08-07 11:12:55                    |
| 20200808 |     2 | 2020-08-08 11:14:41                    |
| 20200809 |     2 | 2020-08-09 11:14:28                    |
| 20200810 |     2 | 2020-08-10 11:15:16                    |
| 20200811 |     2 | 2020-08-11 11:15:25                    |
| 20200812 |     2 | 2020-08-12 11:15:47                    |
| 20200813 |     2 | 2020-08-13 11:14:54                    |
| 20200814 |     2 | 2020-08-14 11:00:54                    |
| 20200825 |     2 | 2020-08-25 11:48:56                    |
| 20200828 |     2 | 2020-08-28 11:07:18                    |
| 20201008 |     2 | 2020-10-08 11:03:05                    |
| 20201014 |     2 | 2020-10-14 21:11:38                    |
| 20210401 |     2 | 2021-10-22 17:44:36                    |
+----------+-------+----------------------------------------+
57 rows in set (0.04 sec)

If you suspect this example is a fluke, and that what you are seeing is a new behavior, let me know, otherwise I think this is a "wontfix" -- covidcast is designed for diff-based issues but doesn't guarantee them. I'll make a note for us to clarify this in the documentation.

@brookslogan
Copy link
Author

brookslogan commented Nov 10, 2021

I re-ran on all geo values (still just for this case signal). The rereporting counts are now:

   issue          n
   <date>     <int>
 1 2020-05-14  5522
 2 2020-05-23    51
 3 2020-05-26    51
 4 2020-06-21  4318
 5 2020-06-25  6391
 6 2020-06-26  6499
 7 2020-06-27  6604
 8 2020-06-28  6656
 9 2020-06-29  6708
10 2020-07-02  6811
11 2020-07-03  6782
12 2020-07-04  6911
13 2020-07-05  6966
14 2020-07-06  7020
15 2020-07-07  7072
16 2020-07-08  7121
17 2020-07-09  7176
18 2020-07-10  7228
19 2020-07-11  7280
20 2020-07-12  7332
21 2020-07-13  7384
22 2020-07-14  7436
23 2020-07-15  7488
24 2020-07-16  7530
25 2020-07-17     2
26 2020-07-18  7696
27 2020-07-19  7599
28 2020-07-20  7650
29 2020-07-21  7701
30 2020-07-22  7751
31 2020-07-23  7803
32 2020-07-24  7853
33 2020-07-25  7903
34 2020-07-26  7956
35 2020-07-27  8005
36 2020-07-28  8058
37 2020-07-29  8108
38 2020-07-30  8159
39 2020-07-31  8211
40 2020-08-02  8307
41 2020-08-03  8364
42 2020-08-04  8412
43 2020-08-05  8466
44 2020-08-06  8517
45 2020-08-07  8568
46 2020-08-08  8619
47 2020-08-09  8670
48 2020-08-10  8721
49 2020-08-11  8772
50 2020-08-12  8823
51 2020-08-13  8874
52 2020-08-14  5905
53 2020-08-24   251
54 2020-08-25  9485
55 2020-08-26    51
56 2020-08-28  6705
57 2020-10-08  7970
58 2020-10-14 12213
59 2020-10-15    96
60 2020-10-29   560
61 2020-10-30   982
62 2021-01-08    12

So it doesn't look like a fluke; this problem has been addressed for this signal[-geotype-timetype] by the diff-based system. (The old rereporting still accounts for ~82% of its rows, though.)

wontfix + better documentation makes sense!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data quality Missing data, weird data, broken data
Projects
None yet
Development

No branches or pull requests

2 participants