-
Notifications
You must be signed in to change notification settings - Fork 2
Section 2 EnsemblLite
In comparative genomics and phylogenetics studies, sampling related biological sequences is the crucial input to our science. Given the scale of databases / genomes, performing analyses such as orthology classification and whole-genome alignments ourselves is often infeasible. The Ensembl project has performed such analyses for many genomes and offers extensive additional information that can be used in our experimental designs.
As seen in the earlier part of the workshop, Ensembl provides different mechanisms for interacting with their data. Which of these mechanisms is most practical depends on your questions, resources, and proximity to Ensembl servers. Here's some illustrative questions to help you think about this:
- How long does it take to get the CDS sequences from 1000 one-to-one orthologs from all fully sequenced primate genomes?
- How long does it take to sample 1000 genes from the Mouse, oriented so that their TSS are centered?
- How much of your time does it take to write the code to perform those tasks?
- How long does it take that code to run?
From Australia, the network latency is such that addressing the above will take many hours (more likely days) via web queries against northern hemisphere servers. The same problem applies to direct queries of the Ensembl MySQL databases (note that support for MySQL data delivery will likely be discontinued). Even if you are geographically closer, this may be your experience when the Ensembl server load is high.
One solution to is to localise Ensembl data. This is what EnsemblLite does. It is a command-line tool developed jointly by the Cogent3 and Ensembl Infrastructure teams. The project objectives are:
- Minimal IT resources to host a localised subset of Ensembl data (should be able to run efficiently on a laptop).
- A CLI that is clear and usable.
- Fast querying, delivering standardised flat file formats for integration with standard tools.
- A software API facilitates the development of community extensions.
Warning EnsemblLite is in the early stages of development, so the feature set is currently limited. You can help us prioritise by telling us what features matter to you! You can also contribute code!
-
An
elt
command line tool for accessing Ensembl data -
elt
subcommands perform individual tasks -
Internet access is only required for downloading user-specified data from the Ensembl FTP server
-
Run commands in parallel when feasible
-
A config file specifies which Ensembl domain to get data from and the desired data