The project contains the code to vertically split a table given the next strategies:
- Division based on correlation to attribute of interest
- Randomized division (vanilla random)
- Randomized division with shrinking of table sizes
- Randomized division with onehot encoding for leaf table columns
- Randomized division with column overlapping over leaf tables
- Randomized division with the possibility to use features are Primary Key - Foreign Key columns
Python 3.8
- Create virtual environment
python3 -m venv env
- Activate environment
source env/bin/activate
- Install requirements
pip install -r requirements.txt
Run main.py - It automatically runs all the division strategies with the titanic dataset provided in the input folder.
import pandas as pd
from utilities.divider import Divider
# Load the data, assign index column and y column
train_csv = "input/titanic_encoded_data.csv"
train_data = pd.read_csv(train_csv, index_col='PassengerId')
y_column = 'Survived'
# Initialise the divider
dv = Divider(train_data, y_column, path='output', table_has_no_index=False)
# Apply division by correlation
dv.divide(strategy="correlation")
# Apply vanilla random division
dv.divide(strategy="random")
# Apply random division with onehot encoding
dv.divide(strategy='random', onehot=4)
# Apply random division with column overlapping
dv.divide(strategy='random', overlap_r=0.5)
# Apply random division with onehot encoding and column overlapping
dv.divide(strategy='random', onehot=4, overlap_r=0.5)
# Apply random division with table shrinking
dv.divide(strategy='shrink')
# Apply random tree division
dv.divide(strategy="random_tree", minimum_columns=4)
# Apply reverse correlation division
dv.divide(strategy="reverse_correlation", minimum_columns=4)
# Apply short reverse correlation division
dv.divide(strategy="short_reverse_correlation", number_children=3, minimum_columns=4)
# Apply all division strategies above simultaneously in 1 LOC
dv.divide_all(overlap_r=0.5, onehot=4)
During division, output will be printed in the console, containing the following:
- Table description, where every line is a separate table with unique index and specified PK column and normal columns. Level of recursion is also shown for every table division and can be used to observe the propagation of the algorithm
- After every division is complete, either True or False will be printed.
True
must always be printed if correlation, vanilla random or random with shrinking strategies are used. Otherwise, either True or False are valid outputs.
One single csv file.
If divide_all
method is used, in the specified output
folder, multiple subfolders will be created and each will contain the following files:
tables.json
- Key-value pairs of table name and PK columns. Note that the algorithm considered they column
as index column in order to ensure that it will remain part of the base table and not be recursively passed down;connections.csv
- Connection between tables - every line is in the formtable_name, FK_name, table_name2, PK_name
, which means thattable_name2
has PK columnPK_name
and can be joined withtable_name
table via FK columnFK_name
;- Multiple files named
table_X_Y.csv
, whereX
andY
are integers. Each file will contain a single table, generated from the table divider. The tables can be used together withtables.json
andconnections.csv
to apply join operations on them;
If divide
method is used, in the specified output
folder, no subfolders will be created and the aforementioned files will simply be put there.
During every vertical split, a primary and a foreign key columns are created, so that it is possible to later join the tables together:
- Division based on correlation to attribute of interest
- We pick a column of importance and calculate its correlation to the other columns. The most correlated columns to it are placed in a separate table, then the second most correlated, ... and finally the least correlated. A column of importance is picked for each of the new tables and the algorithm continuous to recursively divide the table vertically.
- Randomized division (vanilla random)
- During execution, the input table is randomly split vertically into separate tables and each of the new tables is disjoint to the others (they share no common columns). Next, each of the smaller tables is recursively divided again until we are left with tables, which contain a single PK column and a single column from the initial input table. Those we call leaf tables.
- Randomized division with shrinking of table sizes
- The division is conducted similarly to the vanilla random division with the difference that we check whether it is possible to remove duplicating entries from the recurred tables and therefore shrink their sizes. This allows us to reduce the table size and use the fact that we can rely on PK-FK relation when joining tables.
- Randomized division with onehot encoding for leaf table columns
- The division is conducted similarly to the vanilla random division with the difference that the leaf tables can have onehot encoding applied to their single column if: the number of unique values in that column is at most
onehot
.
- The division is conducted similarly to the vanilla random division with the difference that the leaf tables can have onehot encoding applied to their single column if: the number of unique values in that column is at most
- Randomized division with column overlapping over leaf tables
- The division is conducted similarly to the vanilla random division with the difference that it is possible to have multiple columns appearing in multiple of the smaller tables. When we apply the table division, we randomly sample columns from the initial table and we return
overlap_r
ratio of them to it. That means we sample with replacement. As a result, the same columns can get sampled again and appear in multiple of the child tables.
- The division is conducted similarly to the vanilla random division with the difference that it is possible to have multiple columns appearing in multiple of the smaller tables. When we apply the table division, we randomly sample columns from the initial table and we return
- Random Tree Division
- The division is conducted similarly to the vanilla random division with the exception that the algorithm will try to make use of columns with all unique values to make use of them as PK-FK columns. If none of the columns meet this criteria, then artificial PK-FK columns will be created. Note that there is also a hyperparameter
minimum_columns
, which can set the minimum size of resulting sub-tables. Note that there may still be 1 remaining table with less thanminimum_columns
columns. Furthermore, with the current implementation, the divider will not be verifiable due to the implementation of the verifier. Therefore, you may getFalse
as last output, however, the algorithm will still be running the random division correctly.
- The division is conducted similarly to the vanilla random division with the exception that the algorithm will try to make use of columns with all unique values to make use of them as PK-FK columns. If none of the columns meet this criteria, then artificial PK-FK columns will be created. Note that there is also a hyperparameter
- Reverse Correlation Division
- This division first sorts all the columns based on correlation and tries to distribute them through the subtables in the most uniform way. For example, if we have a list of columns from most correlated to least correlated
[x1, x2, x3, x4, x5, x6, x7, x8]
, then they will be split in the next level as follows:[x3, x5, x8]
,[x2, x4, x7]
, and[x1, x6]
, wherex8
,x7
, andx6
will be primary keys respectively. This strategy will continue until the sub-tables have at leastminimum_columns
number of columns. Note that there may still be 1 remaining table with less thanminimum_columns
columns. Furthermore, with the current implementation, the divider will not be verifiable due to the implementation of the verifier. Therefore, you may getFalse
as last output, however, the algorithm will still be running the random division correctly.
- This division first sorts all the columns based on correlation and tries to distribute them through the subtables in the most uniform way. For example, if we have a list of columns from most correlated to least correlated
- Short Reverse Correlation Division
- This division strategy is quite different from the aforementioned division strategies. It works as follows:
- All features are sorted based on correlation from most correlated to least correlated:
[x1, x2, x3, x4, x5, x6, x7, x8]
; - Hyperparameter
minimum_columns=2
splits the list in chunks[[x1, x2], [x3, x4], [x5, x6], [x7, x8]]
; - Hyperparameter
number_children=2
constructs the tree between the tables. For instance, table[x1, x2]
shall become root table with children[x3, x4]
and[x5, x6]
, while[x3, x4]
will become root for[x7, x8]
; - Primary keys are either assigned from existing columns or synthetic ones are created if no features have unique values;
- The assigned primary keys are used to create connections between the subtables;
- Note that both subtables
[x3, x4]
and[x5, x6]
can be joined together since they have the same PK. However,[x5, x6]
and[x7, x8]
will not be joinable, because[x5, x6]
will inherit the assigned primary key of its parent table;
- All features are sorted based on correlation from most correlated to least correlated:
- The division strategy is considered "short", since it allows subtables to join one another without the need to be directly joined to their parent table;
- This division strategy will aim to create longer join paths for highly correlated features and thus put them further away from the base table;.
- This division strategy is quite different from the aforementioned division strategies. It works as follows: