Standardized Project Gutenberg Corpus Tutorial

This repository contains some example notebooks that illustrate how to use the Standardized Project Gutenberg Corpus (SPGC) and reproduce the analysis presented in the manuscript

A standardized Project Gutenberg corpus for statistical analysis of natural language and quantitative linguistics
M. Gerlach, F. Font-Clos, arXiv:1812.08092, Dec 2018

The data is not included in this repository, but you can easily get in in two ways:

Run the code yourself to get the latest version of the corpus, which will include all books in PG as of today.
Download the pre-processed data to get exactly the same books we used in the manuscript (those available up to July 18, 2018)

We assume that you have the two folders at the same level in your folder-hierarchy:

gutenberg/ in which you have the data.
gutenberg-analysis/ with the code in this repository.

You find example notebooks how to access and analyze the data in notebooks_tutorial/.

Tutorial 01: Loading a book and metadata queries has some basic examples on how to easily load a single book; or how to query the metadata to get a selection of books, e.g. from the same author.
more will be added.

You find the notebooks we used to create the figures in our manuscript in notebooks_manuscript.

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
data		data
figures		figures
notebooks_manuscript		notebooks_manuscript
notebooks_tutorial		notebooks_tutorial
src		src
.gitignore		.gitignore
README.md		README.md
run_jsda.py		run_jsda.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Standardized Project Gutenberg Corpus Tutorial

About

Releases

Packages

Languages

SocratesClub/gutenberg-analysis

Folders and files

Latest commit

History

Repository files navigation

Standardized Project Gutenberg Corpus Tutorial

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages