Skip to content

Advanced

Scott Kirkland edited this page Jun 1, 2021 · 3 revisions

Advanced Data Processing

Here we describe how to run the data processing steps described on the homepage but dynamically across any number of files in a folder.

Core DataPrep

First, make sure you are able to run a single file as described on the Wiki homepage. This will allow the main work of processing a single file but not any of the housekeeping involved in reliable and trackable processing.

High-level Workflow Overview

Our workflow for multi-file processing will look like this

  1. Stick all of the files we want to process in a folder somewhere in our home directory

  2. Create a "runner" script which loops through each file and calls sbatch to batch a script which will handle all processing of the given file

  3. Create a "processor" script which will copy the file to a "scratch" working directory, run the core dataprep process on that file, and then move the completed file to a "results" directory.

Runner script

Create a script called dataprep-runner.sh which looks like this

#!/bin/bash

# gets all files in a folder and runs SBATCH on each with proper args

FILES=~/pixels/sorted/*.csv

for f in $FILES
do
	# TODO: copy the file into processing directory/scratch
	export PIXEL_FILE=$f
	FILE_NAME=$(basename -a -s .csv $f)
	echo "Batching dataprep run for $f"
	sbatch -t 120 -J dp-$FILE_NAME -o batch-$FILE_NAME-%j.out dataprep-processor.sh
done

Processor script

Create a file called dataprep-processor.sh like so

#!/bin/bash -l

FILE_NAME=$(basename -a -s .csv $PIXEL_FILE)

# Print the hostname for debugging purposes
hostname

# Set your variables
export OSRM_FILE="/home/$USER/osrm-data/california-latest.osrm"
export HGT_FILES="/home/$USER/hgt/"
export HGT_USER="postit"
export HGT_PASS="PASSWORD"

# copy the file into processing directory/scratch for this job
mkdir -p /scratch/$USER/$SLURM_JOBID

cp $PIXEL_FILE /scratch/$USER/$SLURM_JOBID/$FILE_NAME.csv
PIXEL_FILE=/scratch/$USER/$SLURM_JOBID/$FILE_NAME.csv

export TREATED_OUT_FILE=/scratch/$USER/$SLURM_JOBID/$FILE_NAME-processed.csv

echo "Processing $PIXEL_FILE to $TREATED_OUT_FILE"
srun node ./cec-dataprep/index.js

# remove the copied pixel file and move the result to your home directory  
echo "Finished processing, cleaning up files"
rm $PIXEL_FILE
mv $TREATED_OUT_FILE /home/$USER/pixels/results/

Now you can just run the runner with sh dataprep-runner.sh and it'll submit one job for each CSV file.

Clone this wiki locally