Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Downloading CSV Classifications file via API Issues #2297

Closed
rbruhn opened this issue Apr 10, 2017 · 9 comments
Closed

Downloading CSV Classifications file via API Issues #2297

rbruhn opened this issue Apr 10, 2017 · 9 comments
Labels

Comments

@rbruhn
Copy link

rbruhn commented Apr 10, 2017

As per Zach, I'm opening this to track how the following issue can be resolved:

Zach - The transformation of the annotation from the classification straight from the API to the column in the CSV is exactly what is causing the issue on our side. The huge amount of data stored in the NfN dropdown tasks has run up against our limit to decode it quickly....
If that transformation is necessary, there's no better way to perform these exports than what is available via the API. Please do open a github issue with all the details you have, it will be easier to follow and we can get some more eyes on that way.

At Biospex, we are trying to download the Workflow Classification exports on a nightly basis via the api. Sending a create request, then retrieving the file when it's completed. On several occasions, these files have not been created due to the above description give by Zach.

We then take these files and run them through the Label Reconciliation scripts created by Julie:
https://github.com/juliema/label_reconciliations
This gives us the County/State we need for displaying our heat maps, as well as the counts for our statistics on each Expedition/Workflow.

Zach suggests getting the Task information via the API. However, it comes in such a manner that Julie's Label Reconciliation script cannot run correctly. As those scripts were written to handle the data that comes in the CSV file. The Classification API also does not include the custom Subject information we need for processing (Biospex expedition id, image id, ext).

Below is the Column information contained in the CSV files and their origination as far as I can tell. I could not find all the matching data.

classiification_id - API -> Classifications
user_name - API -> User call
user_id - API -> Classifications
user_ip - ???
workflow_id - Several API points but not needed since we have it
workflow_name - API -> Workflow
workflow_version - API -> Workflow
created_at - API ->Classifications
gold_standard - ??? Apiary states this comes with Classification but I don't see it in response
expert - ??? Apiary states this comes with Classification but I don't see it in response
metadata - API -> Classifications
annotations - ??? See Below**
subject_data - ??? See Below**
subject_ids - API -> Classifications

** There are issues with the JSON represented in the CSV file verses what is represented via the API calls. While some information matches, it seems there is some processing going on to alter the representation before the data is put into these two columns. Should it not be similar?

A solution would be enable the collection of this data, via various calls to the API, and where the JSON data is represented in a way similar to the CSV as it concerns annotations and subject_data, whereby the consumer can store and build their own CSV files locally instead of relying on the NFN servers. Julie's Label Reconciliation scripts could then be run against those local csv files. It would perhaps reduce the load of calling complete CSV downloads every night.

@eatyourgreens
Copy link
Contributor

Related discussion about just yanking those dropdown tasks from the API entirely: zooniverse/Panoptes-Front-End#3636

@zwolf
Copy link
Member

zwolf commented Apr 11, 2017

The data that provides the keys for the annotation translation is found in the workflow's tasks field. With that, you could perform the same kind of operation that we're doing with each classification's annotation (you can see how we're doing it here: https://github.com/zooniverse/Panoptes/blob/master/lib/formatter/csv/annotation_for_csv.rb).

It's that transformation that's the bottleneck. There's probably some potential optimization in our formatters, but there shouldn't be a need to store that kind/amount of information in the workflow in the first place. @camallen's idea of storing large dropdown task data in S3 would be a pretty good start. This would require work on both front and back ends to get together so we'll need to figure out how to prioritize that.

cc: @camallen @marten @trouille @mcbouslog

@zwolf
Copy link
Member

zwolf commented Apr 12, 2017

@rbruhn The code I linked to above is how we're performing the translation in our code. You are familiar with pulling classifications directly from the API, and the data contained in the workflow's task property is all you need to write a script to perform the translation on your end. For example,

GET https://panoptes.zooniverse.org/api/workflows/2679 will return JSON that looks like this:

{  
   "workflows":[  
      {  
         "id":"2679",
         "display_name":"Herbarium_Unlocking Northeastern Forests: Nature's Laboratories for Global Change",
         "tasks":{  
            "T1":{  
               "help":"[help text in markdown]",
               "next":"T17",
               "type":"dropdown",
               "selects":[  
                  {  
                     "id":"156be8e94f1de",
                     "title":"Country",
                     "options":{  
                        "*":[  
                           {  
                              "label":"Not shown",
                              "value":"b50064532baf9"
                           },                           
                           {  
                              "label":"United States of America",
                              "value":"840"
                           },

This continues that way for 33 more questions, including every country on Earth, every state and county in the the US (a couple times), Canada, and Mexico, several versions of the list of years between 1800 and present, and a list of the names of 20,000 people by last name (A-H).

The values there are the keys that are represented in a classification's annotation. The actual translation as performed by Panotpes is the code I linked above, and is doable in whatever language you like. Most relevant bit is here: https://github.com/zooniverse/Panoptes/blob/master/lib/formatter/csv/annotation_for_csv.rb#L141 where it turns the annotation key into a label.

I was told that you may be able to enlist the aid of @juliema and @rafelafrance, who I'm pinging here to let them know what's up. Myself, @camallen and @marten can help answer questions.

Workflows like this example (2679) should be cleaned up and unused tasks deleted. While that wouldn't help Panoptes to process the classifications that have already been made (because of how our workflow versioning system ensures correctly translated annotations), it would certainly help in the future.

@rafelafrance
Copy link

rafelafrance commented Apr 13, 2017

If I understand things correctly, you need to walk the json tree to get the data out of it. I'm only extracting data from the classifications for now but the tree should be very similar to the one in the workflow. On the plus side, the CSV file contains all of the json in a single cell so what I'm doing pretty much what you need to do.

This code can easily be adapted to work with json data directly.

The outer call is here
https://github.com/juliema/label_reconciliations/blob/master/lib/unreconciled_builder.py#L103

and this is where the inner logic resides
https://github.com/juliema/label_reconciliations/blob/master/lib/unreconciled_builder.py#L81

@marten
Copy link
Contributor

marten commented Apr 13, 2017

I wanted to add that you could still download the workflows export and the subjects export, and use those to look up workflow/subject metadata/translations rather than querying our API for it. Those two probably change much less frequently than the classifications export.

@rbruhn
Copy link
Author

rbruhn commented Apr 13, 2017

I spoke to Rafe and Julie regarding this and their reconciliation script. Perhaps I wasn't clear on what I was trying to achieve given the issues with NFN server load. In order to avoid causing further issues with CSV creation and export every night, I wanted to recreate the CSV file on my own locally, then run their reconciliation script. So, having spoke with them, the solution for me would be the following:

To create a workflow classification report locally:

  1. When a Workflow is created, access it via the API and store the relevant information for recreating the columns necessary in the CSV file locally.
  2. Each night, make a call to get the latest classifications by last id and store them locally. Thus, reducing server load and giving me the Annotations.
  3. If we do not have the subject data locally already, call the API using the subject id and store it.
  4. Call API for any user data that is provided to recreate CSV columns.
  5. Build the CSV file locally using the above information and run the reconciliation script on it.

I'm assuming the classification ids are in order. So I could simply append the data to existing CSV files instead of having to recreate them from the start.

@rbruhn
Copy link
Author

rbruhn commented Apr 14, 2017

Ok.... I was wrong in that last post. Or rather, my first post is right and each annotation needs to be run through as Zach suggested with the Workflow keys.

@zwolf
Copy link
Member

zwolf commented Apr 14, 2017

Yeah, and like @marten said, the workflow export button in the lab will get you all the tasks for every workflow all at once. You'll want to re-export if any of them change, but since the ones you care about are live that should happen rarely (if ever).

@stale
Copy link

stale bot commented Nov 9, 2017

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants