Skip to content

Commit

Permalink
remove pgloader, load directly into postgres using ./reparse.sh
Browse files Browse the repository at this point in the history
  • Loading branch information
talos committed Aug 13, 2015
1 parent 40a2ac6 commit 48995c8
Show file tree
Hide file tree
Showing 5 changed files with 64 additions and 81 deletions.
9 changes: 4 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,13 +113,12 @@ building and tax period the data applies to.

### To import the CSV into postgres

*In progress*
You should have [docker4data](http://dockerfordata.com) installed and set up on
your system.

There are a few complicated dependencies here, including
[pgloader](http://pgloader.io), and a few external tables (PLUTO and the DHCR
stabilization building list history.)
./reparse.sh

./import.sh
This will directly parse the `data` folder into docker4data's postgres.

## Data Usage

Expand Down
5 changes: 0 additions & 5 deletions import.sh
Original file line number Diff line number Diff line change
@@ -1,14 +1,9 @@
#!/bin/bash -e

# createdb stabilization 2>/dev/null || :

export PGPASSWORD=docker4data
export PGUSER=postgres
export PGHOST=localhost
export PGPORT=54321
export PGDATABASE=postgres

# need to have pgloader installed
pgloader pgloader.load

psql -f cross-tab-rs-counts.sql
19 changes: 13 additions & 6 deletions parse.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,8 @@
"apts"
]

ROW_BUFFER = 10000

BILL_PDF, STATEMENT_PDF, STATEMENT_HTML, NOPV_PDF, NOPV_HTML = (
'Quarterly Property Tax Bill.pdf', 'Quarterly Statement of Account.pdf',
'Quarterly Statement of Account.html', 'Notice of Property Value.pdf',
Expand Down Expand Up @@ -486,6 +488,7 @@ def main(root): #pylint: disable=too-many-locals,too-many-branches,too-many-stat
"""
writer = csv.DictWriter(sys.stdout, HEADERS)
writer.writeheader()
rows_to_write = []
for path, _, files in os.walk(root):
bbl_json = []
for filename in sorted(files):
Expand Down Expand Up @@ -514,12 +517,10 @@ def main(root): #pylint: disable=too-many-locals,too-many-branches,too-many-stat
file_data = handle.read()
activity_through = parsedate(filename.split(' - ')[0])
for data in handler(file_data):
base = {
'bbl': ''.join(bbl_array),
'activityThrough': activity_through
}
base.update(data)
writer.writerow(base)
data['bbl'] = ''.join(bbl_array)
data['activityThrough'] = activity_through
#writer.writerow(base)
rows_to_write.append(data)
bbl_json.append(data)

except Exception as err: # pylint: disable=broad-except
Expand All @@ -528,5 +529,11 @@ def main(root): #pylint: disable=too-many-locals,too-many-branches,too-many-stat
with open(os.path.join(path, 'data.json'), 'w') as json_outfile:
json.dump(bbl_json, json_outfile)

if len(rows_to_write) >= ROW_BUFFER:
writer.writerows(rows_to_write)
rows_to_write = []
writer.writerows(rows_to_write)


if __name__ == '__main__':
main(sys.argv[1])
64 changes: 0 additions & 64 deletions pgloader.load

This file was deleted.

48 changes: 47 additions & 1 deletion reparse.sh
Original file line number Diff line number Diff line change
@@ -1,4 +1,50 @@
#!/bin/bash

source .env/bin/activate
time python parse.py data/ >data/rawdata.csv 2>data/rawdata.log &

export PGPASSWORD=docker4data
export PGUSER=postgres
export PGHOST=localhost
export PGPORT=54321
export PGDATABASE=postgres

psql -c 'drop table if exists rawdata cascade;'
psql -c 'create table rawdata (
bbl bigint,
activityThrough DATE,
section TEXT,
key TEXT,
dueDate DATE,
activityDate DATE,
value TEXT,
meta TEXT,
apts TEXT
);'
psql -c 'drop table if exists rgb cascade;'
psql -c 'create table rgb (
source VARCHAR,
borough SMALLINT,
year INT,
add_421a INT,
add_421g INT,
add_420c INT,
add_j51 INT,
add_ML_buyout INT,
add_loft INT,
add_former_control REAL,
sub_high_rent_income INT,
sub_high_rent_vacancy INT,
sub_coop_condo_conversion INT,
sub_421a_expiration INT,
sub_j51_expiration INT,
sub_substantial_rehab INT,
sub_commercial_prof_conversion INT,
sub_other INT,
total_sub INT,
total_add REAL,
inflated VARCHAR,
net REAL
);'
time cat data/rgb.csv | psql -c "COPY rgb FROM stdin WITH CSV HEADER NULL '' QUOTE'\"';"

time python parse.py data/ 2>data/rawdata.log | psql -c "COPY rawdata FROM stdin WITH CSV HEADER NULL '' QUOTE '\"';"

2 comments on commit 48995c8

@jqnatividad
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@talos, wondering what's the motivation behind dropping pgloader...
FYI, there's talk considering using it for CKAN ckan/ideas#150

@talos
Copy link
Owner Author

@talos talos commented on 48995c8 Aug 14, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jqnatividad pgloader is great, but it's unnecessary if the CSV being imported is of perfect quality. In this case, I'm generating the CSV in Python and can ensure it's high quality, so I may as well use postgres's COPY directly.

If you're feeding in large CSVs from external sources, pgloader is great. Here, it's an unnecessary dependency.

Please sign in to comment.