update sample sheet generation to handle project shortname #204

tanaes · 2018-02-20T20:52:25Z

We're using the sample_proj field in the Sample Sheets as a unique reference for demultiplexing to assign demuxed fastq files automatically to their project folder.

We need to update the sample sheet generation to read in a per-sample value:
https://github.com/jdereus/labman/blob/0b36cfb63e8222f078f2479c9651e225864fb9c1/labman/db/process.py#L2577

And we may also need to add a column to the projects table to allow a project shortname in a format like 'PI_qiitaID' or similar.

The text was updated successfully, but these errors were encountered:

AmandaBirmingham · 2018-03-13T20:49:27Z

@tanaes Some clarification questions:

You note "we may also need to add a column to the projects table", but I can't find a projects table in the labman/qiita db! Are we perchance talking about the study table? (I ask because I see that the fix for issue Plates List screen (Project names) #32, Plates List screen (Project names), got the "project names" it displays from the qiita.study.title value). If not, has a project table been added at some point but didn't make it into the labman db patch scripts? Or does it have a name other than projects?
If we are talking about the qiita.study table, is the project shortname already in there? For example, is it the study_id field or the study_alias field?
Right now the sample_proj parameter to _format_sample_sheet_data is being set to the sequencing process's run name. After we change sample_proj to instead hold the per-sample project shortname, does the sequencing process's run name need to be included anywhere (else) in the sample sheet output?

tanaes · 2018-03-13T20:56:35Z

I don't think that this is a field that exists in Qiita yet necessarily. CC @jdereus — the thing is we need something that can link up to a consistent folder name on Barnacle. We've had some discussions about how best to do this but I don't know that we've formally adopted a consensus yet!

So, 1 & 2: Correct, it doesn't exist yet in the Qiita db; and 3: I think it would be good to add an additional column to the sample name titled "Project_Name" that includes the long name while the protected column "Sample_Project" contains the short name.

jdereus · 2018-03-13T21:14:28Z

We need to be very careful on the definition of "consistent folder name" and how it applies to a sample sheet. Reading some of this, it sounds like somehow we are expect to know the run folder information prior to being able to generate the sample sheet. I am not aware of a project short name. other than what we add in to the sample sheet. This should be something along the lines of <PI>_<qiita_ID>. That will allow for per project, per plate data aggregation when we do the bcl conversion. Please correct me if I am misunderstanding the issue here. JD From: Jon Sanders <[email protected]<mailto:[email protected]>> Reply-To: jdereus/labman <[email protected]<mailto:[email protected]>> Date: Tuesday, March 13, 2018 at 12:56 PM To: jdereus/labman <[email protected]<mailto:[email protected]>> Cc: "Dereus, Jeff" <[email protected]<mailto:[email protected]>>, Mention <[email protected]<mailto:[email protected]>> Subject: Re: [jdereus/labman] update sample sheet generation to handle project shortname (#204) I don't think that this is a field that exists in Qiita yet necessarily. CC @jdereus<https://github.com/jdereus> - the thing is we need something that can link up to a consistent folder name on Barnacle. We've had some discussions about how best to do this but I don't know that we've formally adopted a consensus yet! So, 1 & 2: Correct, it doesn't exist yet in the Qiita db; and 3: I think it would be good to add an additional column to the sample name titled "Project_Name" that includes the long name while the protected column "Sample_Project" contains the short name. - You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#204 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ANq27eeIAzcpfhBHdWi60_0AZo57WP9pks5teDKEgaJpZM4SMpYh>.

AmandaBirmingham · 2018-03-15T17:59:25Z

@tanaes @jdereus Here are the conclusions we reached in our discussion today; please correct anything I've gotten wrong:

A project shortname will be generated programmatically from the study information for each sample; Labman users will not need to enter (or even know) the shortname.
Project shortname format will be either <PI_name>_<study_ID> or <Lab_person_name>_<PI_name>_<study_ID> ; Amanda will generate samples for some existing studies in each of these formats and run them by Greg for a decision on which is more usable.
The project shortname will NOT be stored in the database anywhere. Instead, the gold-standard repository of the project shortname of any sample is the sample sheet for that sample's sequencing run. These sample sheets are expected to be saved indefinitely; as long as this is the case, it will always be possible to go back and find the project shortname used for a sample (even if the rule used to generate project shortnames from the database has changed since that sample was sequenced).

AmandaBirmingham · 2018-03-15T18:16:52Z

@tanaes @jdereus As I look at our conclusions, I find I have one concern: are samples from any study ever sequenced more than once? Like, maybe if the original sequencing run produced bad data?? If so, the naming convention we laid out above will not keep the sequencing results of the multiple runs separate.

In the approach we chose, every sequencing run for a given study X will have the same project shortname, which could result in data from multiple runs being aggregated together into the same barnacle folder after demuxing (if folders persist between runs). Depending on the how smart the naming conventions used in the demuxing code are (of which I know nothing), it is even possible that old demuxed data might be overwritten with new demuxed data.

Do these possibilities concern you? If so, how would you like to proceed?

tanaes · 2018-03-21T19:59:03Z

@AmandaBirmingham @jdereus and I when discussing this system envisioned that within the output project folder, each run would get it's own run folder, retaining the Illumina-encoded run folder information (which includes things like the sequencing instrument serial number and flow cell serial number) so that we'd be able to maintain this per-run information, so I think it's ok to not have the sample names themselves be globally unique.

jdereus · 2018-03-21T20:06:24Z

I think that depends on how the data is pulled back. If, as I believe, the samples are pulled back based off of individual wells on specific plates, then it should not be an issue. But if there is some check somewhere that checks for uniqueness, then we might have an issue. I also think we might be discussing two different things here. Output location after bcl conversion and sample naming on _any_ plate and well throughout entire system. From: Jon Sanders <[email protected]<mailto:[email protected]>> Reply-To: jdereus/labman <[email protected]<mailto:[email protected]>> Date: Wednesday, March 21, 2018 at 11:59 AM To: jdereus/labman <[email protected]<mailto:[email protected]>> Cc: "Dereus, Jeff" <[email protected]<mailto:[email protected]>>, Mention <[email protected]<mailto:[email protected]>> Subject: Re: [jdereus/labman] update sample sheet generation to handle project shortname (#204) @AmandaBirmingham<https://github.com/amandabirmingham>@jdereus<https://github.com/jdereus> and I when discussing this system envisioned that within the output project folder, each run would get it's own run folder, retaining the Illumina-encoded run folder information (which includes things like the sequencing instrument serial number and flow cell serial number) so that we'd be able to maintain this per-run information, so I think it's ok to not have the sample names themselves be globally unique. - You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#204 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ANq27Tb6Pex8DJ4-jlhIy4SHCvjh_VSSks5tgrEHgaJpZM4SMpYh>.

AmandaBirmingham · 2018-04-24T16:53:58Z

@tanaes Merrily wrote code for this and then realized an outstanding question:

What do you want the "project shortname" to be for non-experimental samples (blanks, empties, controls, etc)? I'll need to put something (if only an empty string or an None) into the array at the positions for those wells ...

tanaes · 2018-04-24T16:56:30Z

Sweet! Let me consider this and get back to you. @ElDeveloper, this might interface with the problem you're working on.

…

-j

On Tue, Apr 24, 2018 at 9:54 AM Amanda Birmingham ***@***.***> wrote: @tanaes <https://github.com/tanaes> Merrily wrote code for this and then realized an outstanding question: - What do you want the "project shortname" to be for non-experimental samples (blanks, empties, controls, etc)? I'll need to put something (if only an empty string or an None) into the array at the positions for those wells ... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#204 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AH6JAFfBI0fDktMW7TXGwk-ruhyZwDSmks5tr1ingaJpZM4SMpYh> .

AmandaBirmingham · 2018-04-24T17:09:24Z

@tanaes Also, this question is not directly related to producing the project shortname, but I just want to double-check: what you want as the Sample_ID, Sample_Name, etc, in the shotgun sample sheet are strings that are the sample ids plus the plate and well they were plated on in the original sample plate? As in, "1_SKB1_640202_21_A1" (where the actual sample_id in qiita.study_sample is "1_SKB1_640202")?

tanaes · 2018-04-24T17:14:45Z

We don't actually need to have the plate info -- just the study + sample identifier is ok, i don't want to encode extraneous data in the filename. TBH I'd rather have a non-human-readible unique identifier but I don't think that will work in our system.

AmandaBirmingham · 2018-04-24T17:25:47Z

@tanaes Just to be clear, my question above is about the contents of the "Sample_ID" and "Sample_Name" columns in the shotgun sample sheet that LabPerson generates; as far as I know, these values aren't file names (or are you saying they are used as that, somewhere downstream)? As I said, this question strays a bit from the task of generating the project shortname; sorry :)

I just wanted to double-check that, whatever you use the sample sheet for after getting back the sequencing data, you actively want these "sample id plus position" descriptors in it rather than the actual keys that would allow you to, say, look up the sample metadata in Qiita (without having to strip off extraneous plate id and well position pieces at the end of the string). If you DO want the "sample id plus position" info (or you just don't care :) ), then all is copacetic. If actual Qiita sample ids would be more useful to you, it would be very easy to put them in the shotgun sample sheet instead of the "sample id plus position" ids.

…on what project shortname should be for blanks/empties/etc. Currently just outputs empty string for those project shortnames.

tanaes · 2018-04-27T14:53:02Z

Sample names

Based on how the pipeline currently works, I think providing the actual Qiita Sample IDs to the format_sample_sheet machinery is going to be best. They do end up being munged into filenames (we end up encoding both the Qiita sample ID and the Illumna BCL2Fastq-compatible name in the sample sheet, and the latter is what gets prepended to the fastq filename), but we want to retain the original Qiita ID for when we rename these files later one. Currently this process is all already handled, and so I don't think

Freaking blanks

This is a problem that keeps rearing its head. The informal idiom we've used is that -- typically -- only a single study is included on an extraction plate. Those blanks then get inherited by that study (study-level association), with a rather inconsistent naming convention which may or may not specify well and plate number within the study.

Really I think this is best solved by having the study-level modality to the sample plating interface that @ElDeveloper and I were discussing as a way to avoid having to display the long study identifier in the window. This would allow any extraction-level controls to be unambiguously associated with a particular project, which I think should be the preferred way to do it. Anything downstream of that (e.g. leftover wells in library prep plates) I think would be ok having a 'None'-equivalent study identified and project shortname.

An alternative would be to have a 'Controls' study that combined all of these types of samples. This is maybe not such a terrible idea, as appropriate controls for a given plate or process could in principle be queried from the database.

Fix amplicon pooling (issue #204)

tanaes · 2018-05-07T16:48:33Z

OK, after chatting about this with some folks, it seems like the best option vis-a-vis the study sheet is to have any of the controls on a sequencing run end up in a 'Controls' demultiplex folder after BCL2Fastq. We don't necessarily need to make this an actual Qiita study, but that would enable a uniform place to access control samples downstream.

tanaes · 2018-05-09T22:50:07Z

@AmandaBirmingham Just checking in on this because I think we're pretty dang close to being able to live test this project.

Is the only thing left here after merging 202ce44 to modify the create sample sheet code to insert this additional field?

…le sheet to hold short projectname for each sample, or "Controls" for non-experimental samples (control, empty, blank, etc).

… Sample_Project column based on #204 fix

tanaes added enhancement feature request priority:medium and removed feature request enhancement labels Mar 9, 2018

tanaes mentioned this issue Mar 9, 2018

Toplevel issue: For deploy! #211

Closed

11 tasks

AmandaBirmingham self-assigned this Mar 12, 2018

AmandaBirmingham added a commit that referenced this issue Apr 24, 2018

Tentatively fixes issue #204; not final until I get input from users …

202ce44

…on what project shortname should be for blanks/empties/etc. Currently just outputs empty string for those project shortnames.

tanaes mentioned this issue Apr 30, 2018

ENH: Prepend study identifiers to samples #226

Merged

tanaes added a commit that referenced this issue May 4, 2018

Merge pull request #222 from tanaes/fix-amplicon-pooling

a085877

Fix amplicon pooling (issue #204)

AmandaBirmingham mentioned this issue May 10, 2018

Change sample sheet Sample_ID and Sample_Name to use actual sample id instead of sample "content"? #237

Open

AmandaBirmingham added a commit that referenced this issue May 10, 2018

Fixes #204 by changing value of Sample_Project column in shotgun samp…

192a131

…le sheet to hold short projectname for each sample, or "Controls" for non-experimental samples (control, empty, blank, etc).

AmandaBirmingham mentioned this issue May 10, 2018

Issue 204 fix #238

Merged

AmandaBirmingham added a commit that referenced this issue May 10, 2018

Updated EXP_SHOTGUN_SAMPLE_SHEET to represent new expected values for…

0e4827b

… Sample_Project column based on #204 fix

tanaes closed this as completed in #238 May 15, 2018

AmandaBirmingham mentioned this issue Apr 24, 2019

Metagenomics Sample Sheet generation assigns project blank wells to a separate folder. #483

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update sample sheet generation to handle project shortname #204

update sample sheet generation to handle project shortname #204

tanaes commented Feb 20, 2018

AmandaBirmingham commented Mar 13, 2018

tanaes commented Mar 13, 2018

jdereus commented Mar 13, 2018 via email

AmandaBirmingham commented Mar 15, 2018 •

edited

Loading

AmandaBirmingham commented Mar 15, 2018

tanaes commented Mar 21, 2018

jdereus commented Mar 21, 2018 via email

AmandaBirmingham commented Apr 24, 2018

tanaes commented Apr 24, 2018 via email

AmandaBirmingham commented Apr 24, 2018

tanaes commented Apr 24, 2018

AmandaBirmingham commented Apr 24, 2018

tanaes commented Apr 27, 2018

tanaes commented May 7, 2018

tanaes commented May 9, 2018

update sample sheet generation to handle project shortname #204

update sample sheet generation to handle project shortname #204

Comments

tanaes commented Feb 20, 2018

AmandaBirmingham commented Mar 13, 2018

tanaes commented Mar 13, 2018

jdereus commented Mar 13, 2018 via email

AmandaBirmingham commented Mar 15, 2018 • edited Loading

AmandaBirmingham commented Mar 15, 2018

tanaes commented Mar 21, 2018

jdereus commented Mar 21, 2018 via email

AmandaBirmingham commented Apr 24, 2018

tanaes commented Apr 24, 2018 via email

AmandaBirmingham commented Apr 24, 2018

tanaes commented Apr 24, 2018

AmandaBirmingham commented Apr 24, 2018

tanaes commented Apr 27, 2018

tanaes commented May 7, 2018

tanaes commented May 9, 2018

AmandaBirmingham commented Mar 15, 2018 •

edited

Loading