Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make a more generalized 'upstream' connector in composition objects/tables? #234

Open
tanaes opened this issue May 9, 2018 · 11 comments
Open

Comments

@tanaes
Copy link
Collaborator

tanaes commented May 9, 2018

Yesterday, @mortonjt and I were working together on issue #77 and running into a challenge that @ElDeveloper and I also encountered in issue #195: different composition types have differently named tables or methods names necessary to access the constituent samples.

In issue #195, this led to the following code:

https://github.com/jdereus/labman/blob/0dd5afef6a980ff25d6b420618828a2bddae383c/labman/gui/handlers/process_handlers/quantification_process.py#L57-L67

In issue #77, we're trying to follow compositions 'upstream', but this requires us to have a lookup table that references the current composition type so that we can know the appropriate column name to query for the upstream composition:

screenshot

https://github.com/mortonjt/labman/blob/33277e138dde41a5789759df20e3c08c01ddfad2/labman/db/plate.py#L580-L607

Would it be possible to generalize this somehow? @AmandaBirmingham, do you have any thoughts on this? Is it possible, for example, to add a column name alias to the composition type tables so that one could SELECT upstream_composition_id FROM qiita.gdna_composition instead of SELECT sample_composition_id FROM qiita.gdna_composition?

@josenavas
Copy link
Member

IMO this should be pushed down to the DB and create a function to achieve such thing. That will localize all the code in a single place. An example on how this is done is in the function "qiita.get_plate_studies" in the db_patch_manual.sql file. That approach is way faster than the current implemented approach, it doesn't create duplicated information, and there is no need to modify the DB.

@tanaes
Copy link
Collaborator Author

tanaes commented May 9, 2018

Agree that it would be good to have this in the db. I'm worried about maintainability having all this plate type-specific logic though -- that function is at the end of a 2.5 MB file! And if we change any of our processes, we'll have to update the logic accordingly.

How hard would it be to modify these tables in a way that could allow this type of access to be generalized? I know almost no SQL to be sure, but was imagining something like a 'column name alias'.

@josenavas
Copy link
Member

that is possible but then you have to make sure to maintain consistency on updates, since it is data duplication (which such code is basically doing something similar that the function that mention above is doing to ensure consistency -> hence needs to be updated on changes).

That file is horrible, and it should probably be break down by functionality to aid development. Note, however, that once the system is in production, any updated that you need to do to such function need to go through a patch to the database, not modifying that file directly. That file is initializing the DB to be able to start working at day 0.

@AmandaBirmingham
Copy link
Collaborator

@tanaes asked "Is it possible, for example, to add a column name alias to the composition type tables so that one could SELECT upstream_composition_id FROM qiita.gdna_composition instead of SELECT sample_composition_id FROM qiita.gdna_composition?"

Let's be careful here: although it is the case that MOST of the (current) compositions have only one upstream composition, some of them have more than one--and there is nothing about a composition which limits it to ONE "upstream" composition id. For example, library_prep_16s_composition--one of the composition types you reference in the code snippet above--records two "upstream" composition ids: normalized_gdna and primer.

That said: in the current structure, any composition table that "inherits" from gdna_composition leads back to a single sample_composition. Perhaps there could be a "sample_composition" property on all *gdna_composition objects that handles the traceback in a way specific to that particular composition type, but hides it from the user?

@tanaes
Copy link
Collaborator Author

tanaes commented May 23, 2018

If it's a method of the object though, doesn't that require us to go through Python (and hence a separate SQL call) for each composition in turn?

@josenavas
Copy link
Member

I don't think I understand your question. Those python methods just call an SQL method, rather than backtracking the entire chain. An alternative to reduce the code duplication of such method (which it is 3 or 4 LOC) would be to create a subclass of Composition that just adds that method and make all the classes that have a "sample_composition" inherit from it. I'm not sure if that actually saves complexity or adds...

@tanaes
Copy link
Collaborator Author

tanaes commented May 24, 2018

It's not the code duplication I'm concerned about, it's that you'd have to call the python function to call the SQL e.g. 384 times for one plate to get the 'upstream' plates, which would be very slow. Those sorts of calls have kept biting us in terms of performance.

@josenavas
Copy link
Member

Oh I see - note that such behavior is a result of the ability to "cherry pick" samples from plates. An alternative is to create bulk operations and/or move some of the functionality from python to SQL. I think in that case it just depends on the nature of the operation being made.

For example, for issue #77, ideally you can perform that entire operation in SQL and just return the final result to python, rather than encoding the functionality in python. In issue #195 a similar argument can be made - it may be useful to create a bulk operation on the plate level to return some kind of useful structure containing all the sample ids and their location on such plate. Note that those type of functions create jumps on the OOP structure, but those are valid arguments in favor of performance (which is common).

An alternative is also exploit parallelization. A fair amount of operations are read-only, and hence can be performed at the same time.

@tanaes
Copy link
Collaborator Author

tanaes commented May 24, 2018 via email

@josenavas
Copy link
Member

I may be able to put some suggestion together by the beginning of next week. Not, however, that I will not be able to test the code locally.

@charles-cowart
Copy link
Collaborator

Migrating to nice-to-have, as this issue isn't impeding current functionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants