Skip to content

Latest commit

 

History

History
 
 

pypkg

Open Redatam (Python Package)

Python Package BuyMeACoffee

About

Open Redatam is a software for extracting raw information from REDATAM databases.

For the standalone C++ command line application and desktop app, see the main directory of this repository.

Install the Python package using a virtual environment:

git clone https://github.com/pachadotdev/open-redatam.git
cd open-redatam/pypkg

python -m venv venv
source venv/bin/activate
python -m pip install --upgrade pip
pip install pandas numpy pybind11
pip install --use-pep517 .

As a developer, be sure to delete the previous build after doing changes and re-installing:

rm -rf build dist redatam.egg-info
pip install --use-pep517 .

As an optional step, you can run the tests:

python tests/basic-test.py

Processed data

If you only need the processed data, you can download the microdata repository. It is available in RDS format for easy loading into R.

Available datasets:

  • Argentina: 1991, 2001, 2010
  • Bolivia: 2001, 2012
  • Chile: 2017
  • Ecuador: 2010
  • El Salvador: 2007
  • Guatemala: 2018
  • Mexico: 2000

Requirements

Python 3.8 or higher.

Usage

For a given census, such as the Chilean Census 2017, run the following command:

import redatam
redatam.read_redatam("input-dir/dictionary.dicx")

Please read the vignette for a more detailed explanation and how this package can be used in conjunction with dplyr and other packages.

Differences with the C++ standalone application

The Python package uses a modified copy of the C++ code to read the REDATAM databases that parses data into dictionary of data frames instead of writing to CSV files.

Credits

Open Redatam was created and is supported by Lital Barkai ([email protected]).

The tests, installation instructions and Python package were created by Mauricio "Pacha" Vargas Sepulveda ([email protected])

The original converter was created by Pablo De Grande. See here for more information.

This project uses pugixml created by Arseny Kapoulkine to structure a part of the output data.