-
Notifications
You must be signed in to change notification settings - Fork 12
Extensions
A lot of functionalities are best handled by an extension system, rather than more and more elements to the code base. These will particular to Presidio (ie, the MySQL-formatted Bookworm), but could potentially be ported to other platforms as well.
Here are some extensions that would less than a weekend's work under the current system:
- Genderation ==============
Given a field that contains names (first or complete), add fields to the Bookworm that include gender both as a flat determination and as a probability. Ideally, should take some logic about the birth year of the person into account (19th century Leslies are male, etc.). Could be implemented using my old code, which is a bit more flexible, or Lincoln Mullen's R package, which I've used on one Bookworm.
- Geolocation ===============
Given the name of field that contains place names, add in a longitude-latitude formatted set of places. The real trick here is exactly nailing the standard Bookworm format. We've used separate Lat,Lon, before, but I now think that's inefficient.
Discussed a bit with Mitch Fraas for state of the unions; tag each document by the places it mentions in XML, and then proceed to tag those with the ordinary geolocation module.
-
Serial Killer ================ Implement the Ngrams serial killer algorithm in Bookworm. Been done once, but might be a nice pocket example of how this can work on the OL/Hathi data.
-
University Linker ====================
On my mind because of one Bookworm in particular, but with an academic audience it might be nice to have some pathway to populating texts that contain unstructured text of university names with all the data about universities in ICOADS.
- The SPARQL link ===================
This would be the greatest thing ever to happen to Bookworm, but it's hard to quickly explain. Find me (Ben) if you want to know more.
- Topic Models ===============
It's not too tricky to impute some classifiers based on topic modeling a corpus--this would add a new, potentially useful, set of metadata to each document. (I've also fooled around with classifying each individual word through MALLET; this would be interesting, but the architecture changes are slightly greater).