Skip to content

Adding data to the archive

Alex Horak edited this page Jan 17, 2016 · 37 revisions

The archive has a set of ingestion commands from the Zeega platform that can be used to bulk import large data sets by the archive administrators.

Creating a new Archive user

In most cases, every archive has its own user (e.g. Yahoo! Japan). When adding new media to the archive, if it belongs to a new source it's necessary to create a new user and get its id. To create a new user, go to the sign up page, create a regular user and use the profile page to update the thumbnail image and the user bio with the archive description.

The user id (required for the next section) can be retrieved from the url of the user page. In the example above, Yahoo! Japan is user id is 3 and is at the end of the user page url.

Viewing Data through Admin Dashboard

Rather than going through MySQL, you can view a list of our items (with convenient filters hidden in a dropdown) through the Admin Dashboard. You have to have a JDA account and it has to be set to admin privileges for it to be accessible, and Ryo or anyone else with admin privilege is able to give you admin privilege through the Admin Dashboard User list. This is a great way to check if your data uploaded correctly.

General data import workflow

  1. Create a JSON file following the instructions from below on this page

  2. Validate your file with JSONLint

  3. Log in to the development server and try to import the data to the database

    ssh [email protected]
    cd /var/www/zeega
    sudo app/console zeega:persist --file_path=PATH_TO_THE_JSON_FILE --ingestor=console --user=USER_ID --replace_duplicates
    

    The parameter --check_for_duplicates is optional. When this parameter is used only items that don't exist on the database will be imported. The duplicate check is made using the item URI.

    The parameter --user refers to the user to whom the data belongs, eg: Asahi is user 468.

    The parameter --replace_duplicates is optional. When this parameter is used, items that already exist in the database will have their tags, location data, type, date created, etc. updated. The duplicate check is made using the item's ATTRIBUTION URI, so that if an item's URI is messed up it can be replaced.

  4. If everything went well and you didn't get any errors from the ingestion command, take a quick look at the database and check if the data seems to be properly encoded.

  5. If the data on the database looks correct trigger a Solr re-index by going to the Solr Admin page, selecting "Verbose", "Clean" and "Commit" and "Execute Import"

  6. When the Solr indexing is finished go to your dev instance (e.g. http://dev.jdarchive.org/your-user/web) and check if data is correct.

  7. If everything looks good you can import the data to production, see the following information below

Importing to production

Make sure there is data backup before making any changes to the production database. There are two ways to go about the backups: create a new database snapshot or rely on an existing snapshot. We take daily snapshots of the database on AWS. To check when the last snapshot was taken, go to the snapshots page and filter by "Automated Snapshots". To create a new manual snapshot press "Create a new snapshot" and wait for it to complete.

Note: REALLY IMPORTANT - Doing this involves re-indexing SOLR. SOLR is flaky and takes about 45~ minutes to complete a re-index, so NEVER do this directly prior to a presentation and most likely, you will only want to do this at night. SOLR may hang while you are importing. In this case, you need to restart it by running sudo service tomcat7 restart on the production server. You may also need to do this after a successful import if the web interface is displaying a low count for the total number of items in the archive (IE. under a million)

MAKE SURE YOU READ THE THING ABOVE BEFORE GOING FORWARD

I MEAN IT

  1. ssh into the production server. (174.129.31.158 or prod.jdarchive.org) Login as ubuntu. Your own SSH key must be uploaded in order to log into to ubuntu: email your dev lead with your public key (either the id_rsa.pub file or the long string contained in the file, beginning with ssh-rsa and often ending with an email - either is fine)
  2. Copy your JSON file onto the production server. DO NOT do this through Git. Use something like sftp or scp.
  3. copy your JSON file into ///var/www/zeega
  4. Change to that directory (ie. cd ///var/www/zeega)
  5. Run the following command (or something similar based off of your needs) You will need to replace the variables in all caps below.
app/console zeega:persist --file_path=YOUR_JSONFILE.json --ingestor=console --user=USER_ID --check_for_duplicates
  1. After this is done running, you should go into the production database and make sure everything looks correct. (Do this from the CONSOLE. Not the web)
  2. At this point, SOLR needs to be reloaded. The recommended way to do this is through the web interface. This is not accessible publicly, so you need to setup SSH port tunneling. For port tunneling, the "source port" is 8080, and the "destination port" is prod.jdarchive.org:8080. Port tunneling is supported by the default ssh program on mac/linux - at least for Mac the command is ssh -L 8080:prod.jdarchive.org:8080 [email protected]. On Windows can be done through putty with the source and destination ports mentioned above (make sure your ssh key is loaded in as well). As stated above, you will need to map a port to port 8080 on production.
  3. Once you have your port tunneling setup, you will likely need to navigate to http://localhost:8080/solr/#/zeega/dataimport//dataimport - copy this directly into the browser. I'm not hyperlinking it because this Wiki screws it up when you click it.
  4. Once you are on this page, you need to check the "verbose", "clean", "commit" check boxes and then click the "execute" button. This takes about 45 minutes to run. You will see a status update along the way on this page, so keep the tunnel open. If it ever stops increasing for over a few minutes, it means it has frozen and you need to restart SOLR. If you don't know how to do this, try reading that portion above that you were told repeatedly to not skip reading.

Data format

In order to import data to the archive it has to be defined in the following JSON format:

{
    "items": [
        {
        	// item1 fields
        },
        {
        	// item2 fields
        }
    ]
}

The data schema for each item is defined on the next section of this document. All the data has to be JSON encoded. An easy way to validate your data and make sure that it is correctly encoded is to use JSONLint website.

Replacing Deleted Items

If you need to replace accidentally-deleted items:

  1. IMMEDIATELY go on AWS. We have snapshots that go back 3 days from the current database at all times. Restore one of these (from before you screwed up) and use something like MySQL workbench to export the old items that you accidentally deleted.
  2. Change file /var/www/zeega/src/Zeega/\CoreBundle\/Resources/config/doctrine/Item.orm.yml on the production server, so that the "id" section reads generator: { strategy: NONE } instead of generator: { strategy: AUTO }
  3. Change file /var/www/zeega/src/Zeega/DataBundle/Service/ItemService.php, by uncommenting the part that says if(isset($itemArray['id'])) {$item->setId($itemArray['id']);} around line 135. With steps 2 and 3, we are skirting around the automatic numbering of the database that would otherwise tack these items onto the end of the database with the next available ID, and putting them into our own spots in the database instead (the IDs of where they were before, which will be preserved in the exported data from step 1).
  4. Edit the exported data from step 1 to get it into proper format. You will most likely have to change the arrays, and you should delete "attributes" and anything with a NULL value. Jsonbeautify.py in the JDA-scripts repo will help with finding problems.
  5. TEST WITH LIKE 10 ITEMS FIRST, proceeding with the data import like usual, and make sure that the items are inserted properly into their old spots in the database. Also, make sure that on the Admin Dashboard, you can click the item and get to the editing page. If an error occurs on the editing page, some of the data was in the wrong format in step 4.
  6. Do the rest of the items if step 5 was successful.
  7. CHANGE BACK THE FILES IN STEPS 2 and 3! Put back the comments around the ID part, and change the strategy back to AUTO.

You should note that all of this takes place on the production server, in the var/www/zeega directory, not the var/www/jda directory.

Item schema

When making a JSON file with data to be imported to the archive, each item has to be defined according to the schema below.

Name Required? Type Description
title No string. Max size is 255. The title of the item
description No string. Max size is 500. The description of the item
uri Yes string. Max size 500 characters. Uri of the item. Used for display and playback. E.g. Flickr photo uri
attribution_uri Yes string. Max size 500 characters. Uri of the item. Used as reference to the original media source. E.g. Flickr photo attribution uri
archive Yes string. Max size 50 characters. Archive to where Item belongs. E.g. Flickr, Reischauer Institute, Youtube, Collection
media_type Yes string. Max size 20 characters. Media type of the item. See the "Media and Layer types" section on the data schema documentation
layer_type Yes string. Max size 20 characters. Layer type of the item. See the "Media and Layer types" section on the data schema documentation
thumbnail_url No string. Max size 500 characters. Uri to the item's thumbnail
child_items_count Yes integer Number of child items. If the item is a collection this field will have the number of items in it.
media_geo_latitude No float Item's latitude. Currently using the EPSG:4326 projection.
media_geo_longitude No float Item's longitude. Currently using the EPSG:4326 projection.
location No string. Max size 100 characters. Item's location name (e.g. city name, street name)
media_date_created No datetime Date when item was created on the source. E.g. date extracted from the Flickr API response when adding a photo.
media_date_created_end No datetime Complimentary to the media_date_created field for items that don't have a creation date but have a time interval.
media_creator_username Yes string. Max size 80 characters. Username at the media source of the user that created the item.
E.g. twitter handle
media_creator_realname No string. Max size 80 characters. Real name at the media source of the user that created the item. If empty, this field is automatically populated with the name its owner.
license No string. Max size 50 characters. Real name at the media source of the user that created the item.
E.g. twitter display name
attributes No JSON array Array of weakly typed attributes
tags No JSON array Array of tags
id_at_source No string Id of an item at the source. Usefull to do incremental imports when importing large datasets from external databases.
published Yes boolean Default is false.

Example JSON file

{
    "items": [
        {
            "username": "",
            "display_name": "",
            "title": "Chinese takeout and a werewolf movie on #halloween #planettakeout",
            "description": "",
            "text": "",
            "uri": "http://distilleryimage10.s3.amazonaws.com/55d42ec823be11e28a5c22000a1f8acf_7.jpg",
            "attribution_uri": "http://instagr.am/p/Rd6wALylKO/",
            "date_created": "2012-11-06 22:14:39",
            "media_type": "Image",
            "layer_type": "Image",
            "archive": "Instagram",
            "thumbnail_url": "http://distilleryimage10.s3.amazonaws.com/55d42ec823be11e28a5c22000a1f8acf_5.jpg",
            "media_geo_latitude": null,
            "media_geo_longitude": null,
            "media_date_created": "2012-11-01 00:51:56",
            "media_date_created_end": null,
            "media_creator_username": "mmsea",
            "media_creator_realname": "Megan Mary",
            "child_items_count": 0,
            "attributes": [],
            "child_items": [],
            "tags": [
                "halloween",
                "planettakeout"
            ],
        }
    ]
}