CAFA PI Challenge

Software Requirements

All code, unless otherwise specified, runs with Python 3.

To install Python dependencies with pip, run:

pip3 install pandas tensorflow sklearn
# Dask needs to be upgraded for tensorflow to work
pip3 install dask --upgrade

Generating the parsed data

Target Data

Download the data files:

bash data_acquisition/Scripts/download_test_data.sh

And parse them:

python parse/targets.py

Cafa 3

The parsed data file is about 240 MB, so it's better to parse it on your own machine than to store in source control.

To parse it, run the following from the project root:

python3 parse/cafa3.py

Simulated data for testing

This data isn't actually useful for training a model, but may be useful in testing whether a model works on very simple data. The script's random seed is set so that this data will be the same each time it's generated.

To generate the simulated data, run the following from the project root:

python3 -m ml.simulate_data

The data can then be found in {project root}/data/example/train_simulated.csv.

Loading the data

In order to load a csv into a format usable by TensorFlow, run the following in Python:

from ml.embeddings import load_data

data, targets = load_data("./data/example/train_fake.csv")

Running the Model

Viewing results

tensorboard --logdir log/

Then go to http://127.0.0.1:6006 in your browser to see the results.

You may need to delete the log/ directory before running the model a second time in order to clear old results.

CNN

python3 -m ml.cnn

Teams:

Data acquisition

Ashton, Calvin, Dan

3/30/2018:

So far we have collected all available protein data for Pseudemonas and the yeast, as well as GO annotations for these proteins.
Summary statistics of these datasets are available in the README.txt file in the data_acquisition directory.
We also have a script that retrieves the JSON data for a given GO term.
We are currently working on a script to generate summary statistics of protein sequences, a script to update GO term to a relevant 
parent term, and also to find more proteins with our desired annotations to enrich our dataset.

Data transformation

Dane, Dallas
Gage, Ben

Analytics

Kimball, Daniel
Jonathan, Erica

Name		Name	Last commit message	Last commit date
Latest commit History 115 Commits
data		data
data_acquisition		data_acquisition
ml		ml
parse		parse
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CAFA PI Challenge

Software Requirements

Generating the parsed data

Target Data

Cafa 3

Simulated data for testing

Loading the data

Running the Model

Viewing results

CNN

Teams:

Data acquisition

3/30/2018:

Data transformation

Analytics

About

Releases

Packages

Contributors 7

Languages

byubrg/cafa-pi

Folders and files

Latest commit

History

Repository files navigation

CAFA PI Challenge

Software Requirements

Generating the parsed data

Target Data

Cafa 3

Simulated data for testing

Loading the data

Running the Model

Viewing results

CNN

Teams:

Data acquisition

3/30/2018:

Data transformation

Analytics

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Languages

Packages