This repo contains the development and experimental codebase of AutoFeat: Transitive Feature Discovery over Join Paths
The code is available for local development, or using Docker.
- Python 3.8
- Java (for data discovery only - Valentine)
- neo4j 5.1.0 or 5.3.0
- Create virtual environment
python -m venv {env-name}
- Activate environment
source {env-name}/bin/activate
- Install requirements
pip install -e .
LighGBM on AutoGluon gives Segmentation Fault or won't run unless you install the corret libomp as described here. Steps:
wget https://raw.githubusercontent.com/Homebrew/homebrew-core/fb8323f2b170bd4ae97e1bac9bf3e2983af3fdb0/Formula/libomp.rb
brew uninstall libomp
brew install libomp.rb
rm libomp.rb
Working with neo4j is easier using neo4j desktop application.
- First, download neo4j Desktop
- Open the app
- "Add" > "Local DBMS"
- Give a name to the DBMS, add a password, and choose Version 5.1.0.
- Change the "password" in config
NEO4J_PASS = os.getenv("NEO4J_PASS", "password")
- "Start" the DBMS
- Once it started, "Open"
- Now you can see the neo4j browser, where you can query the database or create new ones, as we will do in the next steps.
- "Add" > "Local DBMS"
The Docker image already contains all the necesarry for development.
- Open a terminal and go to the project root (where the docker-compose.yml is located).
- Build necessary Docker containers (Note: This step takes a while)
docker-compose up -d --build
- Download our experimental datasets and put them in data/benchmark.
To ingest the data in the local development, it is necessary to follow the steps from Neo4j Desktop setup beforehand.
For Docker, Neo4j browser is available at localhost:7474. No user or password is required.
- Create database
benchmark
in neo4j.- Local development - It is necessary to follow the steps from Neo4j Desktop setup beforehand.
- Docker - Go to localhost:7474 to access neo4j browser.
Input in neo4j browser console:
create database benchmark
Wait 1 minute until the database becomes available.
:use benchmark
- Ingest data
- (Docker) Bash into container
docker exec -it feature-discovery-runner /bin/bash
-
(Local development) Open a terminal and go to the project root.
-
Ingest the data using the following command:
feature-discovery-cli ingest-kfk-data
- Go to config.py and set
NEO4J_DATABASE = 'lake'
2. If Docker is running, restart it. - Create database
lake
in neo4j:- Local development - It is necessary to follow the steps from Neo4j Desktop setup beforehand.
- Docker - Go to localhost:7474 to access neo4j browser.
Input in neo4j browser console:
create database lake
Wait 1 minute until the database becomes available.
:use lake
- Ingest data - depending on how many cores you have, this step can take up to 1-2h.
- (Docker) Bash into container
docker exec -it feature-discovery-runner /bin/bash
-
(Local development) Open a terminal and go to the project root.
-
Ingest the data using the following command:
feature-discovery-cli ingest-data --data-discovery-threshold=0.55 --discover-connections-data-lake
To run the experiments in Docker, first bash into the container:
docker exec -it feature-discovery-runner /bin/bash
feature-discovery-cli --help
will show the commands for running experiments:
run-all
Runs all experiments (ARDA + base + AutoFeat).
feature-discovery-cli run-all --help
will show you the parameters needed for running
run-arda
Runs the ARDA experiments
feature-discovery-cli run-arda --help
will show you the parameters needed for running
--dataset-labels
has to be the label of one of the datasets from datasets.csv
file which resides in data/benchmark.
--results-file
by default the experiments are saved as CSV with a predefined filename in results
Example:
feature-discovery-cli run-arda --dataset-labels steel
Will run the experiments on the steel dataset and the results
are saved in results folder
run-base
Runs the base experiments
feature-discovery-cli run-base --help
will show you the parameters needed for running
--dataset-labels
has to be the label of one of the datasets from datasets.csv
file which resides in data/benchmark.
--results-file
by default the experiments are saved as CSV with a predefined filename.
Example:
feature-discovery-cli run-base --dataset-labels steel
Will run the experiments on the steel dataset and the results
are saved in results folder
run-tfd
Runs the AutoFeat experiments.
feature-discovery-cli run-tfd --help
will show you the parameters needed for running
--dataset-labels
has to be the label of one of the datasets from datasets.csv
file which resides in data/benchmark.
--results-file
by default the experiments are saved as CSV with a predefined filename.
--value-ratio
one of the hyper-parameters of our approach, it represents a data quality metric - the percentage of
null values allowed in the datasets. Default: 0.55
--top-k
one of the hyper-parameters of our approach,
it represents the number of features to select from each dataset and the number of paths. Default: 15
Example:
feature-discovery-cli run-tfd --dataset-labels steel
Will run the experiments on the steel
dataset and the results are saved in results folder
Main source for finding datasets.
-
To recreate our plots, first download the results from here.
-
Add the results in the results folder.
-
Then, open the jupyter notebook: run in the root folder of the project:
jupyter notebook
- Open the file Visualisations.ipynb.
- Run every cell.
We conducted an empirical analysis of the most popular feature selection strategies based on relevance and redundancy.
These experiments are documented at: https://github.com/delftdata/bsc_research_project_q4_2023/tree/main/autofeat_experimental_analysis
This repository is created and maintained by Andra Ionescu