-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
covidcast
has issue
"update" data that re-reports the previous version of observations
#1362
Comments
It sounds like what is happening here is that some issues are diff-type (containing only new or changed values) but surprisingly large number of issues are confirmation-type (containing all values for all dates, regions, and signals, even if they haven't changed). The large JHU patch applied last week contained one confirmation-type issue (20210401) and 203 diff-type issues (20210401 to 20211021), so it's unlikely that the recent repairs caused (the bulk of) this problem. Confirmation-type issues are expected under certain circumstances:
You particular example (
If you suspect this example is a fluke, and that what you are seeing is a new behavior, let me know, otherwise I think this is a "wontfix" -- covidcast is designed for diff-based issues but doesn't guarantee them. I'll make a note for us to clarify this in the documentation. |
I re-ran on all geo values (still just for this case signal). The rereporting counts are now:
So it doesn't look like a fluke; this problem has been addressed for this signal[-geotype-timetype] by the diff-based system. (The old rereporting still accounts for ~82% of its rows, though.) wontfix + better documentation makes sense! |
Actual Behavior:
The covidcast issue data includes rows where all entries besides
issue
andlag
appear to be the same as in the preceding version of that row.Reprex with commentary:
[2022-02-14: Note that this snippet's original code is no longer supported: as_of + issues cannot be used together. I've updated it with something that hopefully is equivalent, but gives 10566 update rows rather than 10465, perhaps due to other version data patches?]
Expected behavior
I soft-expected an issue-query to return only data with changes to entries (other than to
issue
itself andlag
), for two reasons:?covidcast::covidcast_signal
describes issue-queries as "[f]etch[ing] only data that was published or updated ("issued") on these [issues]", which might leave this impression.source,signal,geo_value,time_value,issue
.Context
I am working on cmu-delphi/epiprocess#23 to build some utilities for working with data version history from delphi-epidata or elsewhere. A natural example was to use version history from something in delphi-epidata using an issue-query, which unearthed this surprise.
The re-reporting isn't really a problem for me right now. The utilities I am working on should be built to accept "update" or snapshot (as-of-query) data that contains duplicate data, and to eventually to compact the data to remove these duplicates. Removing re-reporting in covidcast issue data wouldn't ensure that users wouldn't input such re-reported data from other data providers.
The text was updated successfully, but these errors were encountered: