-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data manager for the BLAST database *.loc files? #22
Comments
Related to this, writing tests with automatically installed |
Looks like an attractive way to go! Thanks for introducing me to this. |
Daniel Blankenberg has done some related work on this here, CC @jj-umn |
Thanks for the reference! Something like this data manager approach is a needed plank in what we want to do. Not sure if I mentioned this but we're seeing a plethora of specific curated gene databases (having "primary target" sequences), e.g. http://www.cpndb.ca 's CPN60 Chaperonin database, or hpa.org.uk's Legionella mip database) which are very useful for distinguishing bacterial clades but which currently have no data connection to Galaxy. So we'd like to develop a Galaxy tool that acts as a (more or less generic) gateway to each of these reference databases. A way to create, describe (using ontology), and periodically synchronize with 3rd party sources those reference databases needed in an institution's Galaxy install. (Of course one challenge is that some of these db's have only a web query form online, rather than direct web URLS to their file(s), but that's another battle.). Damion From: "Peter Cock" [email protected] Daniel Blankenberg has done some related work on this here, CC @jj-umn Reply to this email directly or view it on GitHub. |
I'd be interested to see the NCBI BLAST wrappers use the data tables so that it can be used with data managers. I've been planning on getting a data manager work for the NCBI databases, but I realized that it's not going to work if the BLAST wrappers are still using ...... |
See also the recent paper on the Galaxy Data Managers, Blankenberg et al (2014) Wrangling Galaxy's reference data http://dx.doi.org/10.1093/bioinformatics/btu119 and associated wiki pages https://wiki.galaxyproject.org/Admin/Tools/DataManagers Paging @jj-umn - are you still planning to look at this (data managers for BLAST+), or should we try to get Daniel Blankenberg more directly involved? |
On 4/1/14, 1:01 PM, Peter Cock wrote:
It is still on my long list of TODOs, but I just thought it was something that should be done. I'm fine with whomever can get to it first. James E. Johnson, Minnesota Supercomputing Institute, University of Minnesota |
I could try and lend a hand with this if everyone is busy with other tasks. |
On 4/2/14, 8:52 AM, mike8115 wrote:
Sounds good. I'm presuming you've found: https://wiki.galaxyproject.org/Admin/Tools/DataManagers JJ James E. Johnson, Minnesota Supercomputing Institute, University of Minnesota |
Sorry for the wait, I did use the wiki pages on Data Managers. I've got it working in my local instance of Galaxy and my group's development instance and pushed the changes to my fork of galaxy_blast. Is there anything I should look at or do before making a pull request here? |
Note git doesn't allow you to check in an empty directory, which is why you can't see the proposed data_managers folder yet.
I have a tentative plan for where to put the data managers in the repository - d24a5e6 - it seems clearer to me not to put them under the existing @mike8115 - I'll comment on your fork's commit about this and other folder placement issues... it might be best to use a named branch rather than working on your master branch. Also given you've started from Daniel's http://testtoolshed.g2.bx.psu.edu/view/blankenberg/data_manager_example_blastdb_ncbi_update_blastdb I'd like to explicitly get his permission to include his work here with proper attribution. Update email sent http://lists.bx.psu.edu/pipermail/galaxy-dev/2014-April/018987.html |
Hi all, this sounds great. Let me know what I can do to help. Thanks. |
Thanks Dan :) So, first of all have you any objection to your NCBI update example being added to this repository under the MIT licence? Assuming that's fine, would you prefer to do it this yourself as a pull request (subject to debating folder names etc), or let me prepare a commit with @blankenberg as the author (which I can do on a feature branch for your approval, before applying to the master branch)? e.g. I don't follow why you have some files under Once that's done, we can rebase/reapply @mike8115's work which will clearly show his changes. We'll also need to discuss where the BLAST data manager(s) will live on the Tool Shed, which could be an under the IUC account. That could mean deprecating the NCBI BLAST data manager example in favour of a new location? |
I have no objection to an MIT license. What ever is easier/more convenient for you. As far as paths are concerned, the Having the new, more useful, data manager under the IUC would be great. I have no problem deprecating the existing NCBI BLAST data manager when it is ready. |
If either of you are busy with other tasks, I could make the changes and submit the pull request. Since my group is interested in having this feature available in Galaxy, I could dedicate the time to finish it. But if you have the time, I suppose you would be able to make the necessary changes to bring the tool up to IUC standards a lot faster than I would. Splitting the data manager would create a reasonably different script in this case. Presently, I use BLAST's update_blastdb.pl script to retrieve nucleotide and protein databases. Protein domain databases are retrieved by using the ftplib module since they are not available via update_blastdb.pl. I feel that most data managers would be inherently similar, given that the input and output mainly differs by the name and source location, but having one data manager tool controlling too many databases could get messy visually. |
I'll just throw this out there if any of you are interested ... its a tangent on the subject of recreating search results, namely, being able to specify a date or version of a database to recreate the search for. There's a scenario where we'd bring up a slightly customized version of the blast search that has an extra input for entering a date, and behind the scenes we determine which version of a database we'd need, and recall that version to search it. So first question I have is have others been getting requests to provide this kind of a solution? I tested out git to see if it could quickly bring back versions of a largish nucleotide fasta file. Gits diff algorithm completely flops unless you format each fasta entry as a single line (tab delimited) merged from a multi-line entry. But it does handle say 20 versions of a 640mb file pretty well, being able to recreate a version within 8 seconds or so. I've tested it up to about 2 gb, where it takes about 30 seconds to retrieve a version. Seems like the formula is roughly 50mb / second. Another git test however flopped on a protein database formatted in the same fashion. What happens is git decides to delete every row of an old version file and then insert every line of a new version. Very fussy about the composition of lines in its diff algorithm. In the end I scripted a python diff that does the same thing in a low tech fasion, and it handles any size of fasta file, creating a database of versions that's about 1.1x bigger than latest version, at about the same rate as git. So thinking of that now as the mainstay for the scheme. In terms of blast/fasta database management I've been considering another approach that sits outside of galaxy, namely the http://biomaj.genouest.org/ biomaj software, which focuses on regular scheduled downloading. Are any of you familiar with it? Our comrades at the National Microbiology Lab in Winnipeg have been using it and have been satisfied. Apparently it can trigger hooks both before and after file download to customize synchronization/generation of the file-based databases from fasta or other file downloads. I can imagine the same thing done under the Galaxy hood too - a scheduled download + processing hooks? Currently we're also using update_blastdb.pl which is ok but we agree lacks a gui to enable easy management and monitoring of data sources. Feedback appreciated, d. From: "Michael Li" [email protected] If either of you are busy with other tasks, I could make the changes and submit the pull request. Since my group is interested in having this feature available in Galaxy, I could dedicate the time to finish it. But if you have the time, I suppose you would be able to make the necessary changes to bring the tool up to IUC standards a lot faster than I would. Splitting the data manager would create a reasonably different script in this case. Presently, I use BLAST's update_blastdb.pl script to retrieve nucleotide and protein databases. Protein domain databases are retrieved by using the ftplib module since they are not available via update_blastdb.pl. I feel that most data managers would be inherently similar, given that the input and output mainly differs by the name and source location, but having one data manager tool controlling too many databases could get messy visually. Reply to this email directly or view it on GitHub. |
The originally Perhaps ideally the NCBI BLAST database manager could support either approach, or is that too complicated? |
I've started integrating Dan's data manager into this repository on the branch https://github.com/peterjc/galaxy_blast/tree/data_manager - the initial commit 21d7cff just checked in the files from Dan's Test Tool Shed version (adjusting the folder structure), the later commits are just to minor tidying. @blankenberg - does this look OK to you? If so, I will apply that to the master branch. If not, we can tweak things. @mike8115 - once that is done, we'll look at rebasing your work on top of this. |
Currently, the data manager tool uses the recommended method, which @peterjc mentioned, of retaining multiple copies of a BLAST database. For something like @ddooley's diff script, it is something worthwhile to look at. It would reduce the disk space requirement and maintain reproducibility, but I wouldn't know how to implement that. To accommodate that approach, we would have to rewrite the script to handle the fasta files, rather than the pre-formatted databases. Beyond that, I'm not sure what to do about that. In regards to having a scheduled download, from the tool's perspective, it only runs on demand. Having it run routinely is something beyond the data manager tool's control, but I presume it would be possible to do that externally via API? @peterjc - Sounds great. Let me know if anything needs to be done on my end. |
@blankenberg - thanks, I've pushed that to the master branch now. I've not tested it yet either - hopefully adding it to the TravisCI setup will be straightforward. I'll also want to add installation instructions, and the tar command for preparing a ToolShed upload. |
@mike8115 I've tried to rebase/merge your work on the branch https://github.com/peterjc/galaxy_blast/tree/data_manager - see 42cb875 Can you have a look at this, particularly the unit tests - was there a reason for dropping Dan's original test? |
Looking at @mike8115's work, he uses the Data Table approach - https://wiki.galaxyproject.org/Admin/Tools/Data%20Tables - to switch the BLAST+ wrappers use of the
To the shorter:
The column information is instead defined via
However, this new XML file is a potential dependency headache - which ToolShed repository would it belong to - probably a new (common) dependency like the BLAST datatypes? My instinct right now is not to use the Data Table approach at all - aside from the dependency problem, this would actually increase the number of lines of XML in our code base since right now our I've started a thread on the galaxy-dev list to discuss this issue: http://lists.bx.psu.edu/pipermail/galaxy-dev/2014-April/019023.html / http://dev.list.galaxyproject.org/Data-Tables-and-loc-files-Using-named-columns-versus-from-data-table-tc4664149.html |
@peterjc I dropped the original test because the script generated unique ID's by generating a hash value for all the directories and folders. I changed that to use the date instead since the databases update on a daily basis anyway. I suppose I could have modified his test to use regular expressions instead of a strict match. If we decide to move away from data tables, I'm not too sure how else to get tools to use new databases downloaded by the data manager. |
Updating Dan's test to use regular expressions would be nice, but on the other hand the est database is a big one to download just for running a test! Ah. Assuming we must use the Data Table interface for the Data Manager, won't that ultimately update the |
I'll let Dan correct me if I'm wrong, but I think the entries from various |blastdb_.loc files would be merged in memory but not persisted to the single file: /||blastdb_.loc |
@jj-umn is correct. Data managers use namespacing on the .loc files that are installed from a toolshed, i.e. the entries created by the data manager tool will not end up in tool-data/blastdb.loc. |
More detailed reply from @blankenberg on the mailing list:
See http://lists.bx.psu.edu/pipermail/galaxy-dev/2014-April/019027.html - and the Trello Issue https://trello.com/c/VZxV08Qt which says:
It looks like we could have identical copies of the |
I would say that @peterjc Aside from removing Dan's initial test with the EST database, I don't see other concerns. https://github.com/mike8115/galaxy_blast/commit/98326fa75516f7d8fe8a270de5d76fb02aa182e6 contains my last edits based on your current repository. |
@mike8115 - why did you change the tool ID from Another thought: Does it make sense to use the last modified date of the *.tar.gz files on the NCBI FTP site rather than the download date? Also, for now at least, I will not change |
I didn't notice the ID difference. Changing it to Changing the script to do that should be quick. https://github.com/mike8115/galaxy_blast/commit/0785fea8c18eec3ae41ae453fc02354b5a608005 holds all of those changes. With the data tables, the *.loc files from the NCBI BLAST wrappers won't be modified as far as the data manager is concerned. New databases from the data manager are added to the *.loc file in the data manager's files. The data table is simply a list of known *.loc files that a tool can access to receive entries that it could otherwise not get. Just keep in mind that until the wrappers are set to look at other *.loc files, it won't benefit from the data manager unless users manually add the entries from the data manager's *.loc file to the wrappers' *.loc file. |
@mike8115 I've asked Dan about the pre-existing ID inconsistency, see comments on 21d7cff And regarding the |
@mike8115 I'm having trouble getting your tests to pass, both locally and via TravisCI. Work in progress here: https://github.com/peterjc/galaxy_blast/commits/data_manager2 For the protein database tests, the patent amino acids isn't too large so it might be the best choice ( However, for the nucleotide databases, the |
Hi Peter, Thanks for looking into this. Sorry about that mismatch in the tool xml file. It was a sloppy error on my part. I agree with your choices for the databases. I really didn't consider the size of the tests when I added them. Do you want me to rewrite the tests? |
@mike8115 Great - can you work from that |
@peterjc I've copied your branch into my repo and made the changes there https://github.com/mike8115/galaxy_blast/commit/17d760a3e5e07dac4c00758cbca706900199d431. From my machine, the tests completed in 80s. The README looks great, but I've removed some repeated lines in the manual installation section. Otherwise no changes are needed. Also, there's still some outstanding changes mentioned #22 (comment) that has yet to be applied in your branch. |
Hi, |
I've filed #52 for the specific sub-issue of using the new Data Tables approach to defining the columns in the |
Ah, I'd like to hear about this when you are finished, Anthony. I'm just trying out Biomaj now; (as well finishing a versioned database recall system for galaxy but it doesn't use data managers.) Damion From: "abretaud" [email protected] Hi, Reply to this email directly or view it on GitHub. |
@ddooley Yep, no problem! The code will be available on github. I'm currently testing it, fixing some bugs, it should be ready soon. I will tell you when i am done |
In discussion with @blankenberg and @bimbam23 in Portland at GCCBOSC 2018, this example would be useful to look at - it can fetch from pre-defined sources, a URL, or a history entry: It would be nice to be able to pull in a BLAST database from your history, pull in a FASTA file from your history or a URL and build the database with |
Has anybody started implementing this yet? |
Can we use the new Galaxy Data Manager framework to make it easier to manage the BLAST databases configured via *.loc files?
https://wiki.galaxyproject.org/Admin/Tools/DataManagers/
The text was updated successfully, but these errors were encountered: