-
Notifications
You must be signed in to change notification settings - Fork 35
Simple ETLLib Tutorial
Welcome to a short guide on how to install, configure and use ETLLib. For the purposes of this tutorial, we will assume you have a single CSV file with 10s of K of rows. You can use ETLlib to go from a single CSV file with many 10s of K rows, to many JSON files that you can then use to compute tika-similarity on.
So the first step is to take your CSV and get it turned into a TSV. We'll use a Python solution for this, which is independent of whether or not you are using Linux/Mac or Windows. Additionally for this step you are going to have to use Python 2.7. You can use pyenv to get a Python 2.7 version (I personally used 2.7.18). Pyenv works on both Mac and *nix systems. See here.
The expanded tutorial with concise explanations and screenshots is here.
- Install CSVKit.
pip install csvkit==0.9.2
Assuming that you have many files of 10k rows in CSV, you can use the following command on each of the 10k dataset parts. You will want to run it on your whole 100s of K CSV row dataset. Here's the command to quickly generate a TSV from your source CSV (assume it's called data.csv
) of 10k rows:
csvformat -T 10000\ data.csv > 10000\ data.tsv
OK now we have the TSV file.
Let's grab it:
git clone [email protected]:chrismattmann/etllib.git
Reading the instructions for ETLLib, you need libmagic installed. Since I was on a Mac, I installed it with brew.
-
brew install libmagic
(*nix systems will vary).
Once libmagic is installed, this command should work:
man libmagic
OK with libmagic installed you're ready to install etllib.
-
cd etllib && python setup.py install
(make sure again that you are using Python 2.7.x, and as I noted on mine I'm using 2.7.18)
ETLLib should install fine at this point, and when you have installed it, you now have access to the commands listed on the ETLLib Home Page.
In particular, we will use 2 commands from this library. tsvtojson
and repackage
. The first command takes a big TSV file of objects, and converts it to an aggregate big JSON file of objects. Then second command splits that big aggregate JSON file up into individual JSON files.
To use tsvtojson
you will need 2 configuration files. I'm going to provide them to you here. The first is encoding.txt
and the second is colheaders.txt
. Encoding.txt tells the command what supported text encodings are present in the file. The colheaders.txt
tell the command for each row what the column header names it should use for the JSON file field names.
storyPrimaryID
storyID
userID
userPrimaryID
gender
age
title
narrative
media
accountCreateDte
interests
(this assumes a 11 column schema, with those headers; this tutorial was sourced from a social media sample dataset called pixstory
with this schema. Your own schema may vary)
utf-8
us-asci
OK, so for me, I dropped those two files into a folder called conf
and then I created two data file directories: aggregate-json
to hold the aggregate JSON object output from tsvtojson
and json
to hold the 10k JSON files output from repackage
.
So now you're ready to run the tsvtojson command on your TSV file.
tsvtojson -t 10000\ data.tsv -j aggregate-json/aggregate.json -c conf/colheaders.conf -o pixstoryposts -e conf/encoding.conf -s 0.8 -v
on my computer it output:
tsvtojson -t 10000\ data.tsv -j json/aggregate.json -c conf/colheaders.conf -o pixstoryposts -e conf/encoding.conf -s 0.8 -v
['utf-8', 'us-asci']
['storyPrimaryID', 'storyID', 'userID', 'userPrimaryID', 'gender', 'age', 'title', 'narrative', 'media', 'accountCreateDte', 'interests']
Deduping list of structs. Count: [10001]
After dedup. Count: [10001]
Near duplicates detection.
Filtered 0 near duplicates.
After near duplicates. Count: [10000]
Writing output file: [aggregate-json/aggregate.json]
Let's break down what you are seeing: first, when I ran the command, I gave it the two conf files, then I gave it a parameter -o pixstoryposts
This means when you create a big aggregate JSON file you need something to call the objects in it. I called them pixstoryposts
Also you see that I provided the -s 0.8
flag. During processing, you can use jaccard similarity to drop duplicates based on a similarity threshold set between 0 and 1. I told it to drop duplicates that were 0.8 or more similarity based on the jaccard similarity. You'll note that it dropped 0 duplicates. Finally I passed the -v
flag for verbose output just to get all the printing messages
You can now confirm that you have generated the aggregate-json/aggregate.json
file. If you have that, you are ready to run repackage
.
Here is the command to run
repackage -j ../aggregate-json/aggregate.json -o pixstoryposts -v
So let's breakdown the command, which I ran from inside the directory that I want the output files to reside in. So I cd json
first, and then from there ran the command.
I'm not going to paste the full output from the command which looks like a bunch of:
Writing json file: [/Users/mattmann/src/dsci550/data/json/09716830-9698-4f51-a707-b847d2c2aa7c.json]
Writing json file: [/Users/mattmann/src/dsci550/data/json/4f05fd11-d6df-4dda-a7ce-a2c8f8bc7ccf.json]
Writing json file: [/Users/mattmann/src/dsci550/data/json/31b05c6e-51fc-41e0-982f-a85b8ee34fb6.json]
Writing json file: [/Users/mattmann/src/dsci550/data/json/ad33411f-8e87-4946-b4f8-c75f818891b0.json]
Writing json file: [/Users/mattmann/src/dsci550/data/json/807cecec-4cf0-4de6-8ef1-51a59305bcfd.json]
Writing json file: [/Users/mattmann/src/dsci550/data/json/9f7115c5-a9fe-4721-bf21-d4213cd5b19f.json]
Writing json file: [/Users/mattmann/src/dsci550/data/json/c0553aff-8ee3-4077-86af-1beb5bbb99ab.json]
Writing json file: [/Users/mattmann/src/dsci550/data/json/a5cb7398-c083-4c00-88ae-b7f601abdddc.json]
Writing json file: [/Users/mattmann/src/dsci550/data/json/98c742f6-c799-4b2f-a6b8-e604af6a3f9a.json]
Writing json file: [/Users/mattmann/src/dsci550/data/json/ae684b2b-0de0-4850-8683-df347d638c35.json]
The -j ../aggregate-json/aggregate.json
is the JSON file with the 10k objects in it that you want to split into 10k individual files. Then you pass the -o pixstoryposts
(the object name from the tsvtojson
command). Then I passed the -v
flag for verbosity.
That's it!
You then have 10k JSON files and are ready for tika-similarity. Hope this guide was helpful. Try it out! That should take care of your issues with ETLLib. note I didn't have to change anything in the code. Note the code works with both Python 2.7 and Python 3.