Adding data to the archive

The archive has a set of ingestion commands from the Zeega platform that can be used to bulk import large data sets by the archive administrators.

Creating a new Archive user

In most cases, every archive has its own user (e.g. Yahoo! Japan). When adding new media to the archive, if it belongs to a new source it's necessary to create a new user and get its id. To create a new user, go to the sign up page, create a regular user and use the profile page to update the thumbnail image and the user bio with the archive description.

The user id (required for the next section) can be retrieved from the url of the user page. In the example above, Yahoo! Japan is user id is 3 and is at the end of the user page url.

Viewing Data through Admin Dashboard

Rather than going through MySQL, you can view a list of our items (with convenient filters hidden in a dropdown) through the Admin Dashboard. You have to have a JDA account and it has to be set to admin privileges for it to be accessible, and Ryo or anyone else with admin privilege is able to give you admin privilege through the Admin Dashboard User list. This is a great way to check if your data uploaded correctly.

General data import workflow

Create a JSON file following the instructions from below on this page
Validate your file with JSONLint
Log in to the development server and try to import the data to the database
```
ssh [email protected]
cd /var/www/zeega
sudo app/console zeega:persist --file_path=PATH_TO_THE_JSON_FILE --ingestor=console --user=USER_ID --replace_duplicates
```
The parameter --check_for_duplicates is optional. When this parameter is used only items that don't exist on the database will be imported. The duplicate check is made using the item URI.

The parameter --user refers to the user to whom the data belongs, eg: Asahi is user 468.

The parameter --replace_duplicates is optional. When this parameter is used, items that already exist in the database will have their tags, location data, type, date created, etc. updated. The duplicate check is made using the item's ATTRIBUTION URI, so that if an item's URI is messed up it can be replaced.
If everything went well and you didn't get any errors from the ingestion command, take a quick look at the database and check if the data seems to be properly encoded.
If the data on the database looks correct trigger a Solr re-index by going to the Solr Admin page, selecting "Verbose", "Clean" and "Commit" and "Execute Import"
When the Solr indexing is finished go to your dev instance (e.g. http://dev.jdarchive.org/your-user/web) and check if data is correct.
If everything looks good you can import the data to production, see the following information below

Importing to production

Make sure there is data backup before making any changes to the production database. There are two ways to go about the backups: create a new database snapshot or rely on an existing snapshot. We take daily snapshots of the database on AWS. To check when the last snapshot was taken, go to the snapshots page and filter by "Automated Snapshots". To create a new manual snapshot press "Create a new snapshot" and wait for it to complete.

Note: REALLY IMPORTANT - Doing this involves re-indexing SOLR. SOLR is flaky and takes about 45~ minutes to complete a re-index, so NEVER do this directly prior to a presentation and most likely, you will only want to do this at night. SOLR may hang while you are importing. In this case, you need to restart it by running sudo service tomcat7 restart on the production server. You may also need to do this after a successful import if the web interface is displaying a low count for the total number of items in the archive (IE. under a million)

MAKE SURE YOU READ THE THING ABOVE BEFORE GOING FORWARD

I MEAN IT

ssh into the production server. (174.129.31.158 or prod.jdarchive.org) Login as ubuntu. Your own SSH key must be uploaded in order to log into to ubuntu: email your dev lead with your public key (either the id_rsa.pub file or the long string contained in the file, beginning with ssh-rsa and often ending with an email - either is fine)
Copy your JSON file onto the production server. DO NOT do this through Git. Use something like sftp or scp.
copy your JSON file into ///var/www/zeega
Change to that directory (ie. cd ///var/www/zeega)
Run the following command (or something similar based off of your needs) You will need to replace the variables in all caps below.

app/console zeega:persist --file_path=YOUR_JSONFILE.json --ingestor=console --user=USER_ID --check_for_duplicates

After this is done running, you should go into the production database and make sure everything looks correct. (Do this from the CONSOLE. Not the web)
At this point, SOLR needs to be reloaded. The recommended way to do this is through the web interface. This is not accessible publicly, so you need to setup SSH port tunneling. For port tunneling, the "source port" is 8080, and the "destination port" is prod.jdarchive.org:8080. Port tunneling is supported by the default ssh program on mac/linux - at least for Mac the command is ssh -L 8080:prod.jdarchive.org:8080 [email protected]. On Windows can be done through putty with the source and destination ports mentioned above (make sure your ssh key is loaded in as well). As stated above, you will need to map a port to port 8080 on production.
Once you have your port tunneling setup, you will likely need to navigate to http://localhost:8080/solr/#/zeega/dataimport//dataimport - copy this directly into the browser. I'm not hyperlinking it because this Wiki screws it up when you click it.
Once you are on this page, you need to check the "verbose", "clean", "commit" check boxes and then click the "execute" button. This takes about 45 minutes to run. You will see a status update along the way on this page, so keep the tunnel open. If it ever stops increasing for over a few minutes, it means it has frozen and you need to restart SOLR. If you don't know how to do this, try reading that portion above that you were told repeatedly to not skip reading.

Data format

In order to import data to the archive it has to be defined in the following JSON format:

{
    "items": [
        {
        	// item1 fields
        },
        {
        	// item2 fields
        }
    ]
}

The data schema for each item is defined on the next section of this document. All the data has to be JSON encoded. An easy way to validate your data and make sure that it is correctly encoded is to use JSONLint website.

Replacing Deleted Items

If you need to replace accidentally-deleted items:

IMMEDIATELY go on AWS. We have snapshots that go back 3 days from the current database at all times. Restore one of these (from before you screwed up) and use something like MySQL workbench to export the old items that you accidentally deleted.
Change file /var/www/zeega/src/Zeega/\CoreBundle\/Resources/config/doctrine/Item.orm.yml on the production server, so that the "id" section reads generator: { strategy: NONE } instead of generator: { strategy: AUTO }
Change file /var/www/zeega/src/Zeega/DataBundle/Service/ItemService.php, by uncommenting the part that says if(isset($itemArray['id'])) {$item->setId($itemArray['id']);} around line 135. With steps 2 and 3, we are skirting around the automatic numbering of the database that would otherwise tack these items onto the end of the database with the next available ID, and putting them into our own spots in the database instead (the IDs of where they were before, which will be preserved in the exported data from step 1).
Edit the exported data from step 1 to get it into proper format. You will most likely have to change the arrays, and you should delete "attributes" and anything with a NULL value. Jsonbeautify.py in the JDA-scripts repo will help with finding problems.
TEST WITH LIKE 10 ITEMS FIRST, proceeding with the data import like usual, and make sure that the items are inserted properly into their old spots in the database. Also, make sure that on the Admin Dashboard, you can click the item and get to the editing page. If an error occurs on the editing page, some of the data was in the wrong format in step 4.
Do the rest of the items if step 5 was successful.
CHANGE BACK THE FILES IN STEPS 2 and 3! Put back the comments around the ID part, and change the strategy back to AUTO.

You should note that all of this takes place on the production server, in the var/www/zeega directory, not the var/www/jda directory.

Item schema

When making a JSON file with data to be imported to the archive, each item has to be defined according to the schema below.

Name	Required?	Type	Description
`title`	No	string. Max size is 255.	The title of the item
`description`	No	string. Max size is 500.	The description of the item
`uri`	Yes	string. Max size 500 characters.	Uri of the item. Used for display and playback. E.g. Flickr photo uri
`attribution_uri`	Yes	string. Max size 500 characters.	Uri of the item. Used as reference to the original media source. E.g. Flickr photo attribution uri
`archive`	Yes	string. Max size 50 characters.	Archive to where Item belongs. E.g. Flickr, Reischauer Institute, Youtube, Collection
`media_type`	Yes	string. Max size 20 characters.	Media type of the item. See the "Media and Layer types" section on the data schema documentation
`layer_type`	Yes	string. Max size 20 characters.	Layer type of the item. See the "Media and Layer types" section on the data schema documentation
`thumbnail_url`	No	string. Max size 500 characters.	Uri to the item's thumbnail
`child_items_count`	Yes	integer	Number of child items. If the item is a collection this field will have the number of items in it.
`media_geo_latitude`	No	float	Item's latitude. Currently using the EPSG:4326 projection.
`media_geo_longitude`	No	float	Item's longitude. Currently using the EPSG:4326 projection.
`location`	No	string. Max size 100 characters.	Item's location name (e.g. city name, street name)
`media_date_created`	No	datetime	Date when item was created on the source. E.g. date extracted from the Flickr API response when adding a photo.
`media_date_created_end`	No	datetime	Complimentary to the media_date_created field for items that don't have a creation date but have a time interval.
`media_creator_username`	Yes	string. Max size 80 characters.	Username at the media source of the user that created the item. E.g. twitter handle
`media_creator_realname`	No	string. Max size 80 characters.	Real name at the media source of the user that created the item. If empty, this field is automatically populated with the name its owner.
`license`	No	string. Max size 50 characters.	Real name at the media source of the user that created the item. E.g. twitter display name
`attributes`	No	JSON array	Array of weakly typed attributes
`tags`	No	JSON array	Array of tags
`id_at_source`	No	string	Id of an item at the source. Usefull to do incremental imports when importing large datasets from external databases.
`published`	Yes	boolean	Default is false.

Example JSON file

{
    "items": [
        {
            "username": "",
            "display_name": "",
            "title": "Chinese takeout and a werewolf movie on #halloween #planettakeout",
            "description": "",
            "text": "",
            "uri": "http://distilleryimage10.s3.amazonaws.com/55d42ec823be11e28a5c22000a1f8acf_7.jpg",
            "attribution_uri": "http://instagr.am/p/Rd6wALylKO/",
            "date_created": "2012-11-06 22:14:39",
            "media_type": "Image",
            "layer_type": "Image",
            "archive": "Instagram",
            "thumbnail_url": "http://distilleryimage10.s3.amazonaws.com/55d42ec823be11e28a5c22000a1f8acf_5.jpg",
            "media_geo_latitude": null,
            "media_geo_longitude": null,
            "media_date_created": "2012-11-01 00:51:56",
            "media_date_created_end": null,
            "media_creator_username": "mmsea",
            "media_creator_realname": "Megan Mary",
            "child_items_count": 0,
            "attributes": [],
            "child_items": [],
            "tags": [
                "halloween",
                "planettakeout"
            ],
        }
    ]
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly