University of Turku CAFA3 project
Files are in the new machine in address: /home/sukaew/CAFA3
CNN experiment can be run with python train.py You'll need to copy the data folder from /home/kahaka/CAFA3/
All preprocessing steps and sequence analyses can be run within the directory 'sequence_features' using the following command line.
python3 target_process.py -o [out_folder] -s [ori_seq]
The program needs two inputs, the out_folder and the ori_seq. The out_folder should be absolute path ended with '/' where the input ori_seq file/directory and the output features directory reside. The input ori_seq can be one of these four formats (folder of non-compressed fasta files, tar.gz, gz and zip). The sequence analysese include Blast Protein, DeltaBlast, Interproscan5, NetAcet, predGPI, nucPred and Taxonomy hierarchy. All analysis results are in folder called feature
.
All experiments can be run using the program run.py
. The experimental code uses a three-step system. One or more of these actions can be performed using the command line option --action
or --a
. By default, all three actions (build
, classify
and statistics
) are performed.
The run.py program can be called like this:
python run.py -e [TASK] -o [OUTPUT] --targets external
The [TASK]
value can be one of cafa3
, cafa3hpo
or cafapi
. Depending on task, different input files are used. The --targets
option defines how to handle CAFA targets.
cd neural
Download and extract data (data.tar.gz) and model files (features_only.tar.gz) from https://github.com/TurkuNLP/CAFA3/releases/tag/v0.0
python3 predict_new.py ./features_only/ ./data/devel_sequences.fasta.gz ./data/examples.json.gz ./devel_predictions.tsv.gz
This will use the trained model from ./features_only/ directory and make predictions for the target sequences. The input fasta file should not contain linebreaks within the sequences. examples.json.gz contains the pre-generated features. The last parameter is the output path.
By default, the scikit-learn classification will use the train/devel/test split for the learning data. To use n-fold cross-validation instead, use the --fold
option of run.py
. To do 10-fold cross-validation, the program can be run 10 times using a script like this:
for FOLD in 0 1 2 3 4 5 6 7 8 9; do python run.py -o /tmp/CAFA10fold/fold$FOLD --fold $FOLD; done
The program ensemble.py
can be used to combine predictions from different systems and the BLAST fallback baseline. To run the ensemble, use a command like:
python ensemble.py -a [PRED1_DIR] -b [PRED2_DIR] -o [OUTPUT] --baseline 4 --simple --terms 1000000 --write --cafa --clear