Skip to content
Chris Mattmann edited this page Mar 3, 2024 · 9 revisions

Here is a simple tutorial on getting Tika Similarity working with your project. Note this has been tested on Python 3.8.4 and Python 2.7.18 with PyEnv on the mac and also validated on Kubuntu Focus using the same Python versions.

Pre-requisites

  1. Assumption is that you are using the Pixstory dataset, which is 95k rows, along with 11 columns. Any CSV/TSV dataset converted to JSON using the ETLLib process will do.

Steps

  1. First step, read this: you will want to split your 95k JSON dataset up into 100-file chunks. There are scripts there for Linux and Mac: here.

I went ahead and turned the Mac version into a script: here was the result

(dsci550-py384) mattmann@MT-310349 splits % ls
dir_001		dir_008		dir_015		dir_022		dir_029		dir_036		dir_043		dir_050		dir_057		dir_064		dir_071		dir_078		dir_085		dir_092		dir_099
dir_002		dir_009		dir_016		dir_023		dir_030		dir_037		dir_044		dir_051		dir_058		dir_065		dir_072		dir_079		dir_086		dir_093		dir_100
dir_003		dir_010		dir_017		dir_024		dir_031		dir_038		dir_045		dir_052		dir_059		dir_066		dir_073		dir_080		dir_087		dir_094		split_files.sh
dir_004		dir_011		dir_018		dir_025		dir_032		dir_039		dir_046		dir_053		dir_060		dir_067		dir_074		dir_081		dir_088		dir_095
dir_005		dir_012		dir_019		dir_026		dir_033		dir_040		dir_047		dir_054		dir_061		dir_068		dir_075		dir_082		dir_089		dir_096
dir_006		dir_013		dir_020		dir_027		dir_034		dir_041		dir_048		dir_055		dir_062		dir_069		dir_076		dir_083		dir_090		dir_097
dir_007		dir_014		dir_021		dir_028		dir_035		dir_042		dir_049		dir_056		dir_063		dir_070		dir_077		dir_084		dir_091		dir_098
  1. With that done, you will need to git clone http://github.com/chrismattmann/tika-similarity. Even if you did this before grab the latest since I've made some updates that should fix any Py2/3 issues

  2. Once you've checked out tika-similarity you will need to go into the folder and install 2 things. First, you should make sure you have tika-python==2.6.0 installed, and also the latest version of editdistance. Those are the only 2 pip dependencies you will need.

  3. So, the first step is to run through each of the pipelines for {jaccard, edit, and cosine} similarity. NOTE. YOU DONT NEED TO RUN THESE PIPELINES THROUGH ALL 95K FILES The key thing to take away is to grab some subset of the files, even 100 at a time, and run them through the pipeline, and then look at the visualizations and perform some analysis.

  4. I'll illustrate the commands to do this, and then it will become a bit repetitive, so bear with me.

Jaccard Similarity

FOR JACCARD SIMILARITY, read on:

Steps


(dsci550-py384) mattmann@MT-310349 data % python ../jaccard_similarity.py --inputDir splits/dir_001 --outCSV jaccard.csv
 Accepting all MIME Types..... 
 (dsci550-py384) mattmann@MT-310349 data % ls 
 jaccard.csv	splits

2.

(dsci550-py384) mattmann@MT-310349 data % python ../edit-cosine-circle-packing.py --inputCSV jaccard.csv --cluster 2
[
  {
    "children": [
      {
        "name": "splits/dir_001/002289d6-adcd-4fc5-b99a-3c2b583bb4f7.json 0.6363636363636364",
        "size": "0.6363636363636364"
      },
... lots of output ...
"name": "cluster 2"
  }
]
(dsci550-py384) mattmann@MT-310349 data % ls
circle.json	jaccard.csv	splits

(dsci550-py384) mattmann@MT-310349 data % python ../edit-cosine-cluster.py --inputCSV jaccard.csv --cluster 2
[
  {
    "children": [
... lots of output...
"name": "cluster 2"
  }
]
(dsci550-py384) mattmann@MT-310349 data % ls
circle.json	clusters.json	jaccard.csv	splits

(dsci550-py384) mattmann@MT-310349 data % python ../generateLevelCluster.py
(dsci550-py384) mattmann@MT-310349 data % ls
circle.json		clusters.json		jaccard.csv		levelCluster.json	splits
(dsci550-py384) mattmann@MT-310349 data %

(dsci550-py384) mattmann@MT-310349 data % cp -R ../../etllib/html/* .
(dsci550-py384) mattmann@MT-310349 data % ls
circle.json		circlepacking.html	cluster-d3.html		clusters.json		jaccard.csv		levelCluster-d3.html	levelCluster.json	splits
(dsci550-py384) mattmann@MT-310349 data %

NOTE in the above replace ../../etllib with /path/to/etllib

  1. Now you are ready to display your clusters from the Jaccard pipeline. To do this you will use Python SimpleHTTP Server. If you are using Python2, you will use:

python -mSimpleHTTPServer <port>

if you are using Python3 you will use:

python -mhttp.server <port>

  1. So since I am assuming Python 3.8.4 for this guide, do this
(dsci550-py384) mattmann@MT-310349 data % python -mhttp.server 8082
Serving HTTP on :: port 8082 (http://[::]:8082/) ...

(this fires up a server on port 8082, so then visit http://localhost:8082/levelCluster-d3.html )

image

Behold! OK, so now you've gone through ONE pipeline. You can use this as a template for the rest of the pipelines. What I did was try different 100 file subsets, using different pipelines (e.g., one with Jaccard, one with EDIT, and one with Cosine) and compare them. It's important to compare the same subsets, and so on

Edit Distance Similarity

OK, so how would we do the edit distance version of this? Here you go ....

Steps

The first step is to save these current JSONs and so forth, and the Jaccard CSV. So I create a folder to save the pipeline results

1.

(dsci550-py384) mattmann@MT-310349 data % mkdir jaccard
(dsci550-py384) mattmann@MT-310349 data % mv *.json jaccard.csv jaccard
(dsci550-py384) mattmann@MT-310349 data % ls                                   
circlepacking.html	cluster-d3.html		jaccard			levelCluster-d3.html	splits
(dsci550-py384) mattmann@MT-310349 data % ls jaccard 
circle.json		clusters.json		jaccard.csv		levelCluster.json
(dsci550-py384) mattmann@MT-310349 data % 
  1. OK here comes the pipeline. See if you spot the similarities
(dsci550-py384) mattmann@MT-310349 data % python ../edit-value-similarity.py --inputDir splits/dir_001 --outCSV edit.csv
Accepting all MIME Types.....
(dsci550-py384) mattmann@MT-310349 data % ls
circlepacking.html	cluster-d3.html		edit.csv		jaccard			levelCluster-d3.html	splits
(dsci550-py384) mattmann@MT-310349 data % 

(dsci550-py384) mattmann@MT-310349 data % python ../edit-cosine-circle-packing.py --inputCSV edit.csv --cluster 0

...lots of output ...

        "children": [
            {
                "name": "splits/dir_001/00a42665-d719-48d2-9113-b6c68dc50f6b.json  0.75",
                "size": "0.75"
            }
        ],
        "name": "cluster 98"
    }
]
(dsci550-py384) mattmann@MT-310349 data % ls
circle.json		circlepacking.html	cluster-d3.html		edit.csv		jaccard			levelCluster-d3.html	splits
(dsci550-py384) mattmann@MT-310349 data % 

(dsci550-py384) mattmann@MT-310349 data % python ../edit-cosine-cluster.py --inputCSV edit.csv --cluster 2

...lots of output...

(dsci550-py384) mattmann@MT-310349 data % ls
circle.json		circlepacking.html	cluster-d3.html		clusters.json		edit.csv		jaccard			levelCluster-d3.html	splits
(dsci550-py384) mattmann@MT-310349 data % 

(dsci550-py384) mattmann@MT-310349 data % python ../generateLevelCluster.py 
(dsci550-py384) mattmann@MT-310349 data % ls
circle.json		circlepacking.html	cluster-d3.html		clusters.json		edit.csv		jaccard			levelCluster-d3.html	levelCluster.json	splits
(dsci550-py384) mattmann@MT-310349 data % 
  1. Now view the clusters for EDIT distance!
python -mhttp.server 8082
Serving HTTP on :: port 8082 (http://[::]:8082/) ...

http://localhost:8082/levelCluster-d3.html

image

Cosine Similarity

OK, now I will show you for cosine similarity. Remember, the first thing to do is first SAVE your old JSONs for editdistance.

Steps

OK, so for COSINE SIMILARITY .... here's how you do it.

1.

(dsci550-py384) mattmann@MT-310349 data % ls
circlepacking.html	cluster-d3.html		editdistance		jaccard			levelCluster-d3.html	splits
(dsci550-py384) mattmann@MT-310349 data % 

Accepting all MIME Types.....
(dsci550-py384) mattmann@MT-310349 data % ls
circlepacking.html	cluster-d3.html		cosine.csv		editdistance		jaccard			levelCluster-d3.html	splits
(dsci550-py384) mattmann@MT-310349 data % 

(dsci550-py384) mattmann@MT-310349 data % python ../edit-cosine-circle-packing.py --inputCSV cosine.csv --cluster 2

...lots of output ...

(dsci550-py384) mattmann@MT-310349 data % ls
circle.json		circlepacking.html	cluster-d3.html		cosine.csv		editdistance		jaccard			levelCluster-d3.html	splits
(dsci550-py384) mattmann@MT-310349 data % 

(dsci550-py384) mattmann@MT-310349 data % python ../edit-cosine-cluster.py --inputCSV cosine.csv --cluster 2

...lots of output...

(dsci550-py384) mattmann@MT-310349 data % ls
circle.json		circlepacking.html	cluster-d3.html		clusters.json		cosine.csv		editdistance		jaccard			levelCluster-d3.html	splits
(dsci550-py384) mattmann@MT-310349 data % 

(dsci550-py384) mattmann@MT-310349 data % python ../generateLevelCluster.py 
(dsci550-py384) mattmann@MT-310349 data % ls
circle.json		circlepacking.html	cluster-d3.html		clusters.json		cosine.csv		editdistance		jaccard			levelCluster-d3.html	levelCluster.json	splits
(dsci550-py384) mattmann@MT-310349 data % 
  1. OK now you are ready to visualize your cosine similarity clusters!
python -mhttp.server 8082
Serving HTTP on :: port 8082 (http://[::]:8082/) ...

image

OK, now save your cosine pipeline by copying the cosine.csv and *.json files into the cosine directory and you are done with all 3 pipelines. You can now rotate, and use different subsets of the original 95k and see how the results change.

(dsci550-py384) mattmann@MT-310349 data % mkdir cosine
(dsci550-py384) mattmann@MT-310349 data % mv *.json cosine.csv cosine
(dsci550-py384) mattmann@MT-310349 data % ls
circlepacking.html	cluster-d3.html		cosine			editdistance		jaccard			levelCluster-d3.html	splits