-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make a more generalized 'upstream' connector in composition objects/tables? #234
Comments
IMO this should be pushed down to the DB and create a function to achieve such thing. That will localize all the code in a single place. An example on how this is done is in the function "qiita.get_plate_studies" in the db_patch_manual.sql file. That approach is way faster than the current implemented approach, it doesn't create duplicated information, and there is no need to modify the DB. |
Agree that it would be good to have this in the db. I'm worried about maintainability having all this plate type-specific logic though -- that function is at the end of a 2.5 MB file! And if we change any of our processes, we'll have to update the logic accordingly. How hard would it be to modify these tables in a way that could allow this type of access to be generalized? I know almost no SQL to be sure, but was imagining something like a 'column name alias'. |
that is possible but then you have to make sure to maintain consistency on updates, since it is data duplication (which such code is basically doing something similar that the function that mention above is doing to ensure consistency -> hence needs to be updated on changes). That file is horrible, and it should probably be break down by functionality to aid development. Note, however, that once the system is in production, any updated that you need to do to such function need to go through a patch to the database, not modifying that file directly. That file is initializing the DB to be able to start working at day 0. |
@tanaes asked "Is it possible, for example, to add a column name alias to the composition type tables so that one could SELECT upstream_composition_id FROM qiita.gdna_composition instead of SELECT sample_composition_id FROM qiita.gdna_composition?" Let's be careful here: although it is the case that MOST of the (current) compositions have only one upstream composition, some of them have more than one--and there is nothing about a composition which limits it to ONE "upstream" composition id. For example, library_prep_16s_composition--one of the composition types you reference in the code snippet above--records two "upstream" composition ids: normalized_gdna and primer. That said: in the current structure, any composition table that "inherits" from gdna_composition leads back to a single sample_composition. Perhaps there could be a "sample_composition" property on all *gdna_composition objects that handles the traceback in a way specific to that particular composition type, but hides it from the user? |
If it's a method of the object though, doesn't that require us to go through Python (and hence a separate SQL call) for each composition in turn? |
I don't think I understand your question. Those python methods just call an SQL method, rather than backtracking the entire chain. An alternative to reduce the code duplication of such method (which it is 3 or 4 LOC) would be to create a subclass of Composition that just adds that method and make all the classes that have a "sample_composition" inherit from it. I'm not sure if that actually saves complexity or adds... |
It's not the code duplication I'm concerned about, it's that you'd have to call the python function to call the SQL e.g. 384 times for one plate to get the 'upstream' plates, which would be very slow. Those sorts of calls have kept biting us in terms of performance. |
Oh I see - note that such behavior is a result of the ability to "cherry pick" samples from plates. An alternative is to create bulk operations and/or move some of the functionality from python to SQL. I think in that case it just depends on the nature of the operation being made. For example, for issue #77, ideally you can perform that entire operation in SQL and just return the final result to python, rather than encoding the functionality in python. In issue #195 a similar argument can be made - it may be useful to create a bulk operation on the plate level to return some kind of useful structure containing all the sample ids and their location on such plate. Note that those type of functions create jumps on the OOP structure, but those are valid arguments in favor of performance (which is common). An alternative is also exploit parallelization. A fair amount of operations are read-only, and hence can be performed at the same time. |
Yes, and that's what we've been trying to do. But really this is way beyond
my SQL-foo. When @mortonjt and I tried we were stymied by the fact that
different composition types had different column names for specifying the
the input compositions, which made it impossible to make a simple and
generalized 'what are the plates upstream of this plate' query. Any
suggestions?
…On Thu, May 24, 2018 at 7:56 AM Jose Navas ***@***.***> wrote:
Oh I see - note that such behavior is a result of the ability to "cherry
pick" samples from plates. An alternative is to create bulk operations
and/or move some of the functionality from python to SQL. I think in that
case it just depends on the nature of the operation being made.
For example, for issue #77 <#77>,
ideally you can perform that entire operation in SQL and just return the
final result to python, rather than encoding the functionality in python.
In issue #195 <#195> a similar
argument can be made - it may be useful to create a bulk operation on the
plate level to return some kind of useful structure containing all the
sample ids and their location on such plate. Note that those type of
functions create jumps on the OOP structure, but those are valid arguments
in favor of performance (which is common).
An alternative is also exploit parallelization. A fair amount of
operations are read-only, and hence can be performed at the same time.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#234 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AH6JALi5JguXPKLhd0a22PSY8-W7LfEFks5t1soxgaJpZM4T4pbQ>
.
|
I may be able to put some suggestion together by the beginning of next week. Not, however, that I will not be able to test the code locally. |
Migrating to nice-to-have, as this issue isn't impeding current functionality. |
Yesterday, @mortonjt and I were working together on issue #77 and running into a challenge that @ElDeveloper and I also encountered in issue #195: different composition types have differently named tables or methods names necessary to access the constituent samples.
In issue #195, this led to the following code:
https://github.com/jdereus/labman/blob/0dd5afef6a980ff25d6b420618828a2bddae383c/labman/gui/handlers/process_handlers/quantification_process.py#L57-L67
In issue #77, we're trying to follow compositions 'upstream', but this requires us to have a lookup table that references the current composition type so that we can know the appropriate column name to query for the upstream composition:
https://github.com/mortonjt/labman/blob/33277e138dde41a5789759df20e3c08c01ddfad2/labman/db/plate.py#L580-L607
Would it be possible to generalize this somehow? @AmandaBirmingham, do you have any thoughts on this? Is it possible, for example, to add a column name alias to the composition type tables so that one could
SELECT upstream_composition_id FROM qiita.gdna_composition
instead ofSELECT sample_composition_id FROM qiita.gdna_composition
?The text was updated successfully, but these errors were encountered: