This corpus is a subset in XML TEI format of the available poems in the poetry corpus from the Biblioteca Italiana
It contains more than 18000 works from over 159 authors
- Authors: 214
- Works: 25341
- Verses: 1070717
- Words: 7121246
- Characters: 38095393
The file biblitaliana.zip
contains the compressed JSON corpus. The format of each entry is as follows:
{
"url": "https://github.com/linhd-postdata/biblioteca_italiana/blob/master/xml/bibit000213",
"author": "Dante Alighieri",
"collection": "Il Fiore",
"title": "I",
"manually_checked": false,
"text": [
[
{
"verse": "Lo Dio d'Amor con su' arco mi trasse",
"words": [
"Lo",
"Dio",
"d'Amor",
"con",
"su'",
"arco",
"mi",
"trasse"
]
},
...
],
...
]
},
...
Folder json
contains the works by author, and xml
contains the XML TEI version of the text.
The script biblioteca_italiana.py
was used to build the json files.