-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Downloading CSV Classifications file via API Issues #2297
Comments
Related discussion about just yanking those dropdown tasks from the API entirely: zooniverse/Panoptes-Front-End#3636 |
The data that provides the keys for the annotation translation is found in the workflow's It's that transformation that's the bottleneck. There's probably some potential optimization in our formatters, but there shouldn't be a need to store that kind/amount of information in the workflow in the first place. @camallen's idea of storing large dropdown task data in S3 would be a pretty good start. This would require work on both front and back ends to get together so we'll need to figure out how to prioritize that. |
@rbruhn The code I linked to above is how we're performing the translation in our code. You are familiar with pulling classifications directly from the API, and the data contained in the workflow's
This continues that way for 33 more questions, including every country on Earth, every state and county in the the US (a couple times), Canada, and Mexico, several versions of the list of years between 1800 and present, and a list of the names of 20,000 people by last name (A-H). The I was told that you may be able to enlist the aid of @juliema and @rafelafrance, who I'm pinging here to let them know what's up. Myself, @camallen and @marten can help answer questions. Workflows like this example (2679) should be cleaned up and unused tasks deleted. While that wouldn't help Panoptes to process the classifications that have already been made (because of how our workflow versioning system ensures correctly translated annotations), it would certainly help in the future. |
If I understand things correctly, you need to walk the json tree to get the data out of it. I'm only extracting data from the classifications for now but the tree should be very similar to the one in the workflow. On the plus side, the CSV file contains all of the json in a single cell so what I'm doing pretty much what you need to do. This code can easily be adapted to work with json data directly. The outer call is here and this is where the inner logic resides |
I wanted to add that you could still download the workflows export and the subjects export, and use those to look up workflow/subject metadata/translations rather than querying our API for it. Those two probably change much less frequently than the classifications export. |
I spoke to Rafe and Julie regarding this and their reconciliation script. Perhaps I wasn't clear on what I was trying to achieve given the issues with NFN server load. In order to avoid causing further issues with CSV creation and export every night, I wanted to recreate the CSV file on my own locally, then run their reconciliation script. So, having spoke with them, the solution for me would be the following: To create a workflow classification report locally:
I'm assuming the classification ids are in order. So I could simply append the data to existing CSV files instead of having to recreate them from the start. |
Ok.... I was wrong in that last post. Or rather, my first post is right and each annotation needs to be run through as Zach suggested with the Workflow keys. |
Yeah, and like @marten said, the workflow export button in the lab will get you all the tasks for every workflow all at once. You'll want to re-export if any of them change, but since the ones you care about are live that should happen rarely (if ever). |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
As per Zach, I'm opening this to track how the following issue can be resolved:
Zach - The transformation of the annotation from the classification straight from the API to the column in the CSV is exactly what is causing the issue on our side. The huge amount of data stored in the NfN dropdown tasks has run up against our limit to decode it quickly....
If that transformation is necessary, there's no better way to perform these exports than what is available via the API. Please do open a github issue with all the details you have, it will be easier to follow and we can get some more eyes on that way.
At Biospex, we are trying to download the Workflow Classification exports on a nightly basis via the api. Sending a create request, then retrieving the file when it's completed. On several occasions, these files have not been created due to the above description give by Zach.
We then take these files and run them through the Label Reconciliation scripts created by Julie:
https://github.com/juliema/label_reconciliations
This gives us the County/State we need for displaying our heat maps, as well as the counts for our statistics on each Expedition/Workflow.
Zach suggests getting the Task information via the API. However, it comes in such a manner that Julie's Label Reconciliation script cannot run correctly. As those scripts were written to handle the data that comes in the CSV file. The Classification API also does not include the custom Subject information we need for processing (Biospex expedition id, image id, ext).
Below is the Column information contained in the CSV files and their origination as far as I can tell. I could not find all the matching data.
classiification_id - API -> Classifications
user_name - API -> User call
user_id - API -> Classifications
user_ip - ???
workflow_id - Several API points but not needed since we have it
workflow_name - API -> Workflow
workflow_version - API -> Workflow
created_at - API ->Classifications
gold_standard - ??? Apiary states this comes with Classification but I don't see it in response
expert - ??? Apiary states this comes with Classification but I don't see it in response
metadata - API -> Classifications
annotations - ??? See Below**
subject_data - ??? See Below**
subject_ids - API -> Classifications
** There are issues with the JSON represented in the CSV file verses what is represented via the API calls. While some information matches, it seems there is some processing going on to alter the representation before the data is put into these two columns. Should it not be similar?
A solution would be enable the collection of this data, via various calls to the API, and where the JSON data is represented in a way similar to the CSV as it concerns annotations and subject_data, whereby the consumer can store and build their own CSV files locally instead of relying on the NFN servers. Julie's Label Reconciliation scripts could then be run against those local csv files. It would perhaps reduce the load of calling complete CSV downloads every night.
The text was updated successfully, but these errors were encountered: