Requirements: having successfully run the steps in database creation.
Requirements2: check that the species are available in Ensembl or EnsemblMetazoa.
Goal: insert the species used in Bgee, the related taxonomy from NCBI, and sex information about species.
-
The file
source_files/species/bgeeSpecies.tsv
contains the species used in Bgee. This is the file to modify to add/remove a species. -
If you add/remove some species, you need to also update the files pipeline/db_creation/insert_data_sources_to_species.sql and
source_files/species/insert_species_sex_info.sql
. -
Details about
source_files/species/bgeeSpecies.tsv
file: The first line is a header line, it must contain the following columns:speciesId
: The NCBI taxonomy species ID (e.g., 9606 for human).genus
: genus of the species.species
: species name of the species.speciesCommonName
: species common namegenomeFilePath
: path to the genome file of the species on the Ensembl FTP. If no genome available for this species, this can point to the genome of a closely related species (e.g., use of chimpanzee genome for bonobo)genomeVersion
: genome version useddataSourceId
: ID of the data source providing the genome (currently, either Ensembl or EnsemblMetazoa)genomeSpeciesId
: the NCBI taxonomy ID of the species whose genome is being used. In most cases, it is the same asspeciesId
, but for some species with no genome available, it is possible to use the genome of a closely related species (e.g., use of chimpanzee genome for bonobo).fakeGeneIdPrefix
: If the genome of another species is used, the prefixes of the Ensembl gene IDs of this other species will be replaced with the provided prefix.keywords
: a list of keywords/alternative names associated to this species, separated by the character|
.
Note that it is from this file that the information about names of species are obtained, so, no errors allowed in it.
If a line starts with #
, it is commented and the species will not be inserted
-
This pipeline step requires the NCBI taxonomy, provided as an ontology.
-
We cannot use the official taxonomy ontology because, as of Bgee 13, it does not include the last modifications that we requested to NCBI, and that were accepted (e.g., addition of a Dipnotetrapodomorpha term). Also, to correctly infer taxon constraints at later steps, we need this ontology to include disjoint classes axioms between sibling taxa, as explained in a Chris Mungall blog post. The default ontology does not include those.
-
This pipeline step is thus capable of generating its own version of the NCBI taxonomy ontology, in the exact same way as for the official ontology, as described on the OBOFoundry wiki (see notably the Makefile generating the ontology). It is based on files available from the NCBI FTP (https://ftp.ebi.ac.uk/pub/databases/taxonomy/taxonomy.dat). The code to generate disjoint classes axioms is based on the code from the owltools Java class
owltools.cli.TaxonCommandRunner
, in the moduleOWLTools-Runner
. -
This custom taxonomy will include the species used in Bgee and their ancestors, the taxa used in our annotations and their ancestors, the taxa used in Uberon an their ancestors. To extract taxa used in Uberon, we use the
ext
version (this is the one containing more taxa). -
Note that the generation of the taxonomy requires about 15Go of memory.
-
-
This pipeline steps will insert sex information about species, see
source_files/species/insert_species_sex_info.sql
.
-
If it is the first time you execute this step in this pipeline run:
make clean
-
Run Makefile:
make
-
Modify the file pipeline/db_creation/update_data_sources.sql: you need to add the last modification date of the taxonomy used. This information can be found by looking at the file
taxonomy.dat
at https://ftp.ebi.ac.uk/pub/databases/taxonomy/.
-
generated_files/species/step_verification_RELEASE.txt
should contain: the total number of species inserted, the total number of taxa inserted, the number of taxa inserted that represent the least common ancestor of at least two species used in Bgee; the complete list of species ordered by their ID; the complete list of taxa least common ancestor, ordered by their position in the taxonomy (root to leaf); the complete list of taxa ordered by their position in the taxonomy (root to leaf).- Compare the species list between releases.
- Check that there is no species with missing sex information. Otherwise, you need to update the file
source_files/species/insert_species_sex_info.sql
. - The taxa should be displayed ordered from root to leaf, and taxa of a same level should be ordered by alphabetical order of their scientific name. Verify it is correct, it is important.
-
The following files should have been generated:
generated_files/species/bgee_ncbitaxon.owl
, our custom taxonomy ontologygenerated_files/species/annotTaxIds.tsv
, a TSV file containing the IDs of the taxa used in our annotationsgenerated_files/species/allTaxIds.tsv
, a TSV file containing the IDs of the species used in Bgee, of the taxa used in our annotations, of the taxa used in Uberon.
-
You can have an exception thrown, saying that a specified taxon does not exist in the taxonomy ontology, for instance
java.lang.IllegalArgumentException: Taxon NCBITaxon:71164 was not found in the ontology
. It likely means that an incorrect/deprecated taxon is used in Uberon. Remove the ID of the taxon in the filegenerated_files/species/allTaxIds.tsv
(so if the exception is related to a taxonNCBITaxon_71164
, remove the ID71164
). Check if the taxon ID is present in the filegenerated_files/species/annotTaxIds.tsv
, if it is, the error is on us. Otherwise, report the problem on the Uberon tracker, if you identified the taxon in Uberon. You will most likely need to manually modify the ontology to remove the offending taxa in the mean time. -
If you need to re-insert the data in the database, use
make deleteSpeciesAndTaxa
, thenmake clean
.
- generate the TSV files containing the taxa used in out annotations:
make ../../generated_files/species/annotTaxIds.tsv
- generate the TSV files containing all taxa used in Bgee, in our annotations, in Uberon
make ../../generated_files/species/allTaxIds.tsv
- generate the taxonomy ontology:
make ../../generated_files/species/bgee_ncbitaxon.owl
- Remove the species and taxa from the database (this is not done when calling
clean
, to avoid wiping the database by accident)make deleteSpeciesAndTaxa