Using machine learning and the Arvados Lightning project, we were able to predict eye color to 95% accuracy.
This project is reliant on the following Python libraries: scikit-learn, pandas, matplotlib, numpy, scipy.
In addition, the tile searches cannot be run on non-UNIX machines as it requires the system grep
and cat
commands.
To download the NumPy arrays and assembly files needed for the project, set the Arvados API tokens and run ./download_dependencies.sh
- this downloads the tiled data, names, information, and assembly files into the appropriate folders.
First, clone the GitHub repository with git clone --recursive https://github.com/kevin-fang/lightning-eye-classifier
. The --recursive
is important as the tile-searching script is in a submodule.
There are three ways to run the classifier. A Dockerfile has been provided in docker/
as well as instructions for running through Docker.
- Navigate to
src/
. - Generate the classifier with
python generateLeftClassifier.py
- Save the coefficients with
python saveCoefs.py
- Search for each tile in
python tileSearch.py
- First, navigate to
notebooks/
- Open the IPython session using
jupyter notebook
- Open
leftEyeClassifier.ipynb
,saveCoefs.ipynb
, andtileSearch.ipynb
. - Run
leftEyeClassifier.ipynb
first, and set whether to exclude or include hazel. This will generate the classifier and save it insvc.pkl
. - Then run
saveCoefs.ipynb
. This will open thesvc.pkl
classifier and will serialize the learned coefficients incoefs.pkl
. - Finally, run
tileSearch.ipynb
. This will opencoefs.pkl
and search for each tile.
The classifier is able to predict the blue eye color to approximately 95% accuracy when the hazel color is excluded. Otherwise, it is able to reach 88% accuracy.
The classifier is able to find that eye color is reliant on base pairs 28,264,893 to 28,265,118, which is consistent with the HERC2 gene, responsible for eye color (https://ghr.nlm.nih.gov/gene/HERC2#location, https://link.springer.com/article/10.1007%2Fs00439-007-0460-x)