Skip to content

State of the European Union in machine-readable format.

Notifications You must be signed in to change notification settings

pournaki/soteu-dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SOtEU Dataset

This repository contains the State of the European Union speeches in machine (csv) and human (pdf) readable format. Download the latest release here.

Data collection

The raw data is manually collected speech by speech from https://state-of-the-union.ec.europa.eu/ and pasted into single text-files in ./txt/raw/, along with the URL and date of access.

Processing

The preambles (name of the speaker, date…) are removed and cleaned files saved in ./txt/nopreamble/. All speeches are combined into a single csv with the following columns, where doc_id is an integer going from 0 to N, where N is the number of speeches:

doc_iddatespeakerurltext
intstrstrstrstr

This csv is built in Python created using

~ python src/speech2csv.py  

For reading convenience, the speeches are also saved as a pdf using

~ python src/speech2tex.py  
~ pdflatex tex/soteu-speeches.tex

About

State of the European Union in machine-readable format.

Resources

Stars

Watchers

Forks

Packages

No packages published