Skip to content
Joel Natividad edited this page Oct 22, 2024 · 13 revisions

Cookbook

Please feel free to add recipes to the Cookbook!

using qsv, ckanapi, jq and xargs.

  • get a CSV of datasets/users/groups/orgs in a CKAN instance
ckanapi -r https://demo.ckan.org dump datasets --all | qsv jsonl > datasets.csv
ckanapi -r https://demo.ckan.org dump users --all | qsv jsonl > users.csv
ckanapi -r https://demo.ckan.org dump groups --all | qsv jsonl > groups.csv
ckanapi -r https://demo.ckan.org dump organizations --all | qsv jsonl > organizations.csv
  • get a CSV of resources for a given dataset
ckanapi -r https://catalog.data.gov action package_show \
id=low-altitude-aerial-imagery-obtained-with-unmanned-aerial-systems-uas-flights-over-black-beach \
| jq -c '.resources[]' \
| qsv jsonl \
> resources.csv
ckanapi -r https://data.cnra.ca.gov action package_show id="wellstar-oil-and-gas-wells1" \
> wellstar-oil-and-gas-wells.json
cat wellstar-oil-and-gas-wells.json \
| jq -c '.resources[] | select(.name=="CSV") | .url' \
| xargs -L 1 wget -O wellstar.csv
qsv stats --everything wellstar.csv > wellstar-stats.csv

Date Enrichment

qsv luajit map Quarter -x -f getquarter.lua nyc311samp.csv > result-qtr.csv

Note: Change the column name on line 7 of getquarter.lua to adapt it for your use.
Also, you need to have the date.lua module in the same directory.

  • Partition 311 files by Quarter. Create the files in the nyc311byqtr directory
qsv partition Quarter nyc311byqtr result-qtr.csv 
  • convert "Created Date" to ISO-8601 format
qsv apply datefmt "Created Date" nyc311samp.csv > result-iso8601.csv
  • format "Created Date" using the format string "%a %b %e %T %Y %z"
qsv apply datefmt "Created Date" --formatstr "%a %b %e %T %Y %z" nyc311samp.csv > result-datefmt.csv
  • create a "Created Year" column from "Created Date"
qsv apply datefmt "Created Date" --formatstr "%Y" --new-column "Created Year" nyc311samp.csv \
> result-year.csv
  • compute Turnaround Time. Store it in a new column named "TAT"
    Ingredient:
qsv luajit map TAT -x -f turnaroundtime.lua nyc311samp.csv > result-tat.csv
  • What is the average turnaround time in Brooklyn for 311 calls?
qsv search --select City --ignore-case brooklyn nyc311samp.csv \
| qsv luajit map TAT -x -f turnaroundtime.lua \
| tee brooklyn-311-details.csv \
| qsv stats --everything > result-brooklyn311-stats.csv

NOTE: The tee command reads from stdin and writes to both stdout and one or more files at the same time. We do this so we can create the brooklyn-311-details.csv file and pipe the same data to the qsv stats command.

Geocoding

  • using the "Location" column, geocode the nearest city in a new column named "City"
qsv apply geocode Location --new-column City nyc311samp.csv > result-geocoded.csv

Note: The bundled static geocoder uses the reverse-geocoder crate. It uses the geonames cities database, and geocodes to the closest city.

  • geocode the county in a new column
qsv apply geocode Location --new-column County --formatstr county nyc311samp.csv > result-county.csv

Multi-table join avoiding repeated columns

This example was inspired by having to combine multiple tables exported from another system, which were themselves from multiple database joins. Suppose you have have several tables (table_*.csv) which have the same first 10 columns, and then a varying number of additional columns. The column we want to join on is column 2, and for simplicity assume the rows all match perfectly (otherwise you would explore the left and right join options).

cp table_A.csv combined.csv
for NEXT in table_B.csv table_C.csv table_D.csv; do
    qsv join 2 combined.csv 1 <(qsv select 2,11- $NEXT) > new.csv
    mv new.csv combined.csv
done

We use a loop to perform multiple joins. Each time we use xsv select to pull out the index (join column 2) and the columns unique to that file (11 onwards), which could also be done with cut -s "," -f 1,11- $NEXT if preferred. The join column becomes column 1 of the intermediate file.

The proposed qsv join --merge option would stop duplication of the join column.

Little versus Big Endian Spreadsheets

As discussed in issue #609, the CPAN module Spreadsheet::WriteExcel does not generate the usual Little Endian XLS file which Excel produces. There may be other tools doing same.

That causes an issue for qsv search among others.

The quick workaround is to pipe through tr -d '\000'.

For Perl scripts the definitive workaround is to use Excel::Writer::XLSX from the same author which is plug in compatible and requires very little modification of the Perl code (search and replace).

qsv cat rows of many csv files with varying columns

CSVs from a 3rd party are unreliable. Over time, the header formatting may change, a header may get added, or removed. The following script will concatenate hundreds of csv files together that share the same basic set of headers, with marginal differences.

e.g.

Input CSV #1

one,two
hello,world

Input CSV #2

ONE,THREE
more,data

Output CSV

one,two,three
hello,world,
more,,data
#!/bin/bash

# bash scripts/fix.sh ./fixed "*.csv"

DIST=$1
TARGETS=$2

mkdir -p $DIST

# Step 1: collect the unique header names by parsing all the csv files
unique_headers_list=$(qsv headers -j --intersect $TARGETS | tr '[:upper:]' '[:lower:]' | sort -u)
unique_headers_comma=$(echo $unique_headers_list | tr ' ' ',')

# Step 2: For each CSV file, add missing headers with default values
FILES=($TARGETS)
for file in "${FILES[@]}"; do
    FILENAME=${file##*/}
    OUT="$DIST/$FILENAME"

    # make headers lowercase
    LOWERCASE=$OUT.lowercase
    qsv input $file | awk 'NR==1 {print tolower($0)} NR>1 {print}' > $LOWERCASE
    mv $LOWERCASE $OUT

    # find the missing headers, if any
    MISSING_HEADERS=$(echo -e "$(qsv headers -j $OUT)\n$unique_headers_list" | sort | uniq -u | tr '\n' ',' | sed 's/,$//' | sed 's/^,//')

    # add missing headers with empty values
    FULL=$OUT.full
    echo $MISSING_HEADERS | qsv cat columns -o $FULL -p $OUT -
    mv $FULL $OUT

    # the order of headers cannot be assumed, they must be selected by name
    ORDERED=$OUT.ordered
    qsv select $unique_headers_comma -o $ORDERED $OUT
    mv $ORDERED $OUT
done