-
Notifications
You must be signed in to change notification settings - Fork 59
Tutorial
Here is a simple tutorial on getting Tika Similarity working with your project. Note this has been tested on Python 3.8.4 and Python 2.7.18 with PyEnv on the mac and also validated on Kubuntu Focus using the same Python versions.
- Assumption is that you are using the Pixstory dataset, which is 95k rows, along with 11 columns. Any CSV/TSV dataset converted to JSON using the ETLLib process will do.
- First step, read this: you will want to split your 95k JSON dataset up into 100-file chunks. There are scripts there for Linux and Mac: here.
I went ahead and turned the Mac version into a script: here was the result
(dsci550-py384) mattmann@MT-310349 splits % ls
dir_001 dir_008 dir_015 dir_022 dir_029 dir_036 dir_043 dir_050 dir_057 dir_064 dir_071 dir_078 dir_085 dir_092 dir_099
dir_002 dir_009 dir_016 dir_023 dir_030 dir_037 dir_044 dir_051 dir_058 dir_065 dir_072 dir_079 dir_086 dir_093 dir_100
dir_003 dir_010 dir_017 dir_024 dir_031 dir_038 dir_045 dir_052 dir_059 dir_066 dir_073 dir_080 dir_087 dir_094 split_files.sh
dir_004 dir_011 dir_018 dir_025 dir_032 dir_039 dir_046 dir_053 dir_060 dir_067 dir_074 dir_081 dir_088 dir_095
dir_005 dir_012 dir_019 dir_026 dir_033 dir_040 dir_047 dir_054 dir_061 dir_068 dir_075 dir_082 dir_089 dir_096
dir_006 dir_013 dir_020 dir_027 dir_034 dir_041 dir_048 dir_055 dir_062 dir_069 dir_076 dir_083 dir_090 dir_097
dir_007 dir_014 dir_021 dir_028 dir_035 dir_042 dir_049 dir_056 dir_063 dir_070 dir_077 dir_084 dir_091 dir_098
-
With that done, you will need to
git clone http://github.com/chrismattmann/tika-similarity
. Even if you did this before grab the latest since I've made some updates that should fix any Py2/3 issues -
Once you've checked out tika-similarity you will need to go into the folder and install 2 things. First, you should make sure you have
tika-python==2.6.0
installed, and also the latest version ofeditdistance
. Those are the only 2 pip dependencies you will need. -
So, the first step is to run through each of the pipelines for
{jaccard, edit, and cosine}
similarity. NOTE. YOU DONT NEED TO RUN THESE PIPELINES THROUGH ALL 95K FILES The key thing to take away is to grab some subset of the files, even 100 at a time, and run them through the pipeline, and then look at the visualizations and perform some analysis. -
I'll illustrate the commands to do this, and then it will become a bit repetitive, so bear with me.
FOR JACCARD SIMILARITY
, read on:
(dsci550-py384) mattmann@MT-310349 data % python ../jaccard_similarity.py --inputDir splits/dir_001 --outCSV jaccard.csv
Accepting all MIME Types.....
(dsci550-py384) mattmann@MT-310349 data % ls
jaccard.csv splits
2.
(dsci550-py384) mattmann@MT-310349 data % python ../edit-cosine-circle-packing.py --inputCSV jaccard.csv --cluster 2
[
{
"children": [
{
"name": "splits/dir_001/002289d6-adcd-4fc5-b99a-3c2b583bb4f7.json 0.6363636363636364",
"size": "0.6363636363636364"
},
... lots of output ...
"name": "cluster 2"
}
]
(dsci550-py384) mattmann@MT-310349 data % ls
circle.json jaccard.csv splits
(dsci550-py384) mattmann@MT-310349 data % python ../edit-cosine-cluster.py --inputCSV jaccard.csv --cluster 2
[
{
"children": [
... lots of output...
"name": "cluster 2"
}
]
(dsci550-py384) mattmann@MT-310349 data % ls
circle.json clusters.json jaccard.csv splits
(dsci550-py384) mattmann@MT-310349 data % python ../generateLevelCluster.py
(dsci550-py384) mattmann@MT-310349 data % ls
circle.json clusters.json jaccard.csv levelCluster.json splits
(dsci550-py384) mattmann@MT-310349 data %
(dsci550-py384) mattmann@MT-310349 data % cp -R ../../etllib/html/* .
(dsci550-py384) mattmann@MT-310349 data % ls
circle.json circlepacking.html cluster-d3.html clusters.json jaccard.csv levelCluster-d3.html levelCluster.json splits
(dsci550-py384) mattmann@MT-310349 data %
NOTE in the above replace ../../etllib
with /path/to/etllib
- Now you are ready to display your clusters from the Jaccard pipeline. To do this you will use Python SimpleHTTP Server. If you are using Python2, you will use:
python -mSimpleHTTPServer <port>
if you are using Python3 you will use:
python -mhttp.server <port>
- So since I am assuming Python 3.8.4 for this guide, do this
(dsci550-py384) mattmann@MT-310349 data % python -mhttp.server 8082
Serving HTTP on :: port 8082 (http://[::]:8082/) ...
(this fires up a server on port 8082
, so then visit http://localhost:8082/levelCluster-d3.html
)
Behold! OK, so now you've gone through ONE pipeline. You can use this as a template for the rest of the pipelines. What I did was try different 100 file subsets, using different pipelines (e.g., one with Jaccard, one with EDIT, and one with Cosine) and compare them. It's important to compare the same subsets, and so on
OK, so how would we do the edit distance version of this? Here you go ....
The first step is to save these current JSONs and so forth, and the Jaccard CSV. So I create a folder to save the pipeline results
1.
(dsci550-py384) mattmann@MT-310349 data % mkdir jaccard
(dsci550-py384) mattmann@MT-310349 data % mv *.json jaccard.csv jaccard
(dsci550-py384) mattmann@MT-310349 data % ls
circlepacking.html cluster-d3.html jaccard levelCluster-d3.html splits
(dsci550-py384) mattmann@MT-310349 data % ls jaccard
circle.json clusters.json jaccard.csv levelCluster.json
(dsci550-py384) mattmann@MT-310349 data %
- OK here comes the pipeline. See if you spot the similarities
(dsci550-py384) mattmann@MT-310349 data % python ../edit-value-similarity.py --inputDir splits/dir_001 --outCSV edit.csv
Accepting all MIME Types.....
(dsci550-py384) mattmann@MT-310349 data % ls
circlepacking.html cluster-d3.html edit.csv jaccard levelCluster-d3.html splits
(dsci550-py384) mattmann@MT-310349 data %
(dsci550-py384) mattmann@MT-310349 data % python ../edit-cosine-circle-packing.py --inputCSV edit.csv --cluster 0
...lots of output ...
"children": [
{
"name": "splits/dir_001/00a42665-d719-48d2-9113-b6c68dc50f6b.json 0.75",
"size": "0.75"
}
],
"name": "cluster 98"
}
]
(dsci550-py384) mattmann@MT-310349 data % ls
circle.json circlepacking.html cluster-d3.html edit.csv jaccard levelCluster-d3.html splits
(dsci550-py384) mattmann@MT-310349 data %
(dsci550-py384) mattmann@MT-310349 data % python ../edit-cosine-cluster.py --inputCSV edit.csv --cluster 2
...lots of output...
(dsci550-py384) mattmann@MT-310349 data % ls
circle.json circlepacking.html cluster-d3.html clusters.json edit.csv jaccard levelCluster-d3.html splits
(dsci550-py384) mattmann@MT-310349 data %
(dsci550-py384) mattmann@MT-310349 data % python ../generateLevelCluster.py
(dsci550-py384) mattmann@MT-310349 data % ls
circle.json circlepacking.html cluster-d3.html clusters.json edit.csv jaccard levelCluster-d3.html levelCluster.json splits
(dsci550-py384) mattmann@MT-310349 data %
- Now view the clusters for EDIT distance!
python -mhttp.server 8082
Serving HTTP on :: port 8082 (http://[::]:8082/) ...
http://localhost:8082/levelCluster-d3.html
OK, now I will show you for cosine similarity. Remember, the first thing to do is first SAVE your old JSONs for editdistance
.
OK, so for COSINE SIMILARITY .... here's how you do it.
1.
(dsci550-py384) mattmann@MT-310349 data % ls
circlepacking.html cluster-d3.html editdistance jaccard levelCluster-d3.html splits
(dsci550-py384) mattmann@MT-310349 data %
Accepting all MIME Types.....
(dsci550-py384) mattmann@MT-310349 data % ls
circlepacking.html cluster-d3.html cosine.csv editdistance jaccard levelCluster-d3.html splits
(dsci550-py384) mattmann@MT-310349 data %
(dsci550-py384) mattmann@MT-310349 data % python ../edit-cosine-circle-packing.py --inputCSV cosine.csv --cluster 2
...lots of output ...
(dsci550-py384) mattmann@MT-310349 data % ls
circle.json circlepacking.html cluster-d3.html cosine.csv editdistance jaccard levelCluster-d3.html splits
(dsci550-py384) mattmann@MT-310349 data %
(dsci550-py384) mattmann@MT-310349 data % python ../edit-cosine-cluster.py --inputCSV cosine.csv --cluster 2
...lots of output...
(dsci550-py384) mattmann@MT-310349 data % ls
circle.json circlepacking.html cluster-d3.html clusters.json cosine.csv editdistance jaccard levelCluster-d3.html splits
(dsci550-py384) mattmann@MT-310349 data %
(dsci550-py384) mattmann@MT-310349 data % python ../generateLevelCluster.py
(dsci550-py384) mattmann@MT-310349 data % ls
circle.json circlepacking.html cluster-d3.html clusters.json cosine.csv editdistance jaccard levelCluster-d3.html levelCluster.json splits
(dsci550-py384) mattmann@MT-310349 data %
- OK now you are ready to visualize your cosine similarity clusters!
python -mhttp.server 8082
Serving HTTP on :: port 8082 (http://[::]:8082/) ...
OK, now save your cosine pipeline by copying the cosine.csv and *.json files into the cosine directory and you are done with all 3 pipelines. You can now rotate, and use different subsets of the original 95k and see how the results change.
(dsci550-py384) mattmann@MT-310349 data % mkdir cosine
(dsci550-py384) mattmann@MT-310349 data % mv *.json cosine.csv cosine
(dsci550-py384) mattmann@MT-310349 data % ls
circlepacking.html cluster-d3.html cosine editdistance jaccard levelCluster-d3.html splits