The Gutenberg Graphalyzer project aims to provide a means for measuring the structural complexity of works of literature. It's currently hard-coded to only support texts from the gutenberg project.
All results provided were based on a corpus created from Project Gutenberg's April 2010 DVD. A processed corpus collection can be found here
Further documentation is located on my wiki here. The documentation contained ont he project page will have some useful queries, issues with parsing, and other miscellaneous matter.
-
graphalyzer.py -- Parses an individual project gutenberg text. Assumes header and footer licensing is present. Run the script with '-h' for further information.
-
make-db-py3.py -- Creates the DB from a directory of project gutenberg text files and an RDF catalog file. Has three global "constants" that must be set for proper usage.
-
run-experiment.sh -- Runs the graphalyzer script after finding all text files from the corpus. Uses xargs to run the script in parallel. Thanks to the use of SQLite there are no race conditions as far as I have seen.
-
remove-duplicates.py -- Takes one command ine argument: the directory containing all the text files. Removes duplicate file types. Prefers ASCII over ISO and ISO over UTF-8.
-
result-analysis.r -- Generates a set of graphs from SQL queries to the results. Can be used as a guideline for future data exploration with R. Can easily be run from the command line with 'R CMD BATCH result-analysis.r'
All research results, presentations, and documentation are Creative Commons 3.0 Attribution-NonCommercial. All source code is GPL V3.0
GutenbergGraphalyzer by Nathaniel Husted is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.