Skip to content

badmonster-nc/stackoverflow_in_pg

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Requirement

install postgres

On ubuntu, install postgres on ubuntu
On centos, install postgres on centos

Import Stackoverflow database in pg

Original source: http://www.bortzmeyer.org/stackoverflow-to-postgresql.html

The social network Stackoverflow (https://stackoverflow.com/) regularly publishes a dump of its database under a Creative Commons free licence. We can find dump file here: Main download link: https://archive.org/download/stackexchange File download link: https://archive.org/download/stackexchange/stackoverflow.com-Badges.7z https://archive.org/download/stackexchange/stackoverflow.com-Comments.7z https://archive.org/download/stackexchange/stackoverflow.com-PostHistory.7z https://archive.org/download/stackexchange/stackoverflow.com-PostLinks.7z https://archive.org/download/stackexchange/stackoverflow.com-Posts.7z https://archive.org/download/stackexchange/stackoverflow.com-Tags.7z https://archive.org/download/stackexchange/stackoverflow.com-Users.7z https://archive.org/download/stackexchange/stackoverflow.com-Votes.7z

Extract the XML file of each downloaded file.

At the time of writing this document, the dump files are from June 2017 Each XML file store a class of Stack Overflow objects:

Object 7zip file size XML file size SQL file size (parsed xml) Lines in pg table Pg table size
Badges 166 MB 2.51 GB 1.2 GB 22 997 200 lines 1 658 MB
Comments 3.16 GB 14.2 GB 10.6 GB 58 187 400 lines 11 GB
PostHistory 18.2 GB 88.9 GB 66.4 GB 93 512 900 lines 54 GB
PostLinks 57 MB 492 MB 215 MB 4 214 710 lines 242 MB
Posts 10.4 GB 52.3 GB 38.1 GB 36 149 100 lines 31 GB
Tags 704 KB 4.18 MB 1.67 MB 49 306 lines 2 808 kB
Users 284 MB 2.07 GB 753 MB 7 246 590 lines 773 MB
Votes 757 MB 11.7 GB 5.23 GB 128 369 000 lines 5 422 MB

Transform xml file to sql file

The python_src directory contains a python file per xml file to parse To launch the parser

python so2pg-<xml_to_parse>.py <where/my/xml/are> > <path_to_result>/<parsed_xml>.sql

If you want test it, in python directory there is a subdirectory named sample_xml.

#change to directory python_src
#launch 
python so2pg-badges.py ./sample_xml > badges.sql

#you obtain a sql file ready to load to postgres

Load sql file in pg database

# connect to user postgres
sudo su - postgres
# create your database
createdb --encoding=UTF-8 so
#create database table in database so
psql -U postgres -d so -f so-create.sql 

All you have to do now is to load the sql files. You can load them in the order you want. I disable all integrity constraints. Move to the directory where the sql files are located and start the loads like this

psql -U postgres -d so -f badges.sql
psql -U postgres -d so -f comments.sql 
psql -U postgres -d so -f posthistory.sql 
psql -U postgres -d so -f postlinks.sql 
psql -U postgres -d so -f posts.sql 
psql -U postgres -d so -f tags.sql 
psql -U postgres -d so -f users.sql 
psql -U postgres -d so -f votes.sql  

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages