-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update sample sheet generation to handle project shortname #204
Comments
@tanaes Some clarification questions:
|
I don't think that this is a field that exists in Qiita yet necessarily. CC @jdereus — the thing is we need something that can link up to a consistent folder name on Barnacle. We've had some discussions about how best to do this but I don't know that we've formally adopted a consensus yet! So, 1 & 2: Correct, it doesn't exist yet in the Qiita db; and 3: I think it would be good to add an additional column to the sample name titled "Project_Name" that includes the long name while the protected column "Sample_Project" contains the short name. |
We need to be very careful on the definition of "consistent folder name" and how it applies to a sample sheet. Reading some of this, it sounds like somehow we are expect to know the run folder information prior to being able to generate the sample sheet.
I am not aware of a project short name. other than what we add in to the sample sheet. This should be something along the lines of <PI>_<qiita_ID>. That will allow for per project, per plate data aggregation when we do the bcl conversion. Please correct me if I am misunderstanding the issue here.
JD
From: Jon Sanders <[email protected]<mailto:[email protected]>>
Reply-To: jdereus/labman <[email protected]<mailto:[email protected]>>
Date: Tuesday, March 13, 2018 at 12:56 PM
To: jdereus/labman <[email protected]<mailto:[email protected]>>
Cc: "Dereus, Jeff" <[email protected]<mailto:[email protected]>>, Mention <[email protected]<mailto:[email protected]>>
Subject: Re: [jdereus/labman] update sample sheet generation to handle project shortname (#204)
I don't think that this is a field that exists in Qiita yet necessarily. CC @jdereus<https://github.com/jdereus> - the thing is we need something that can link up to a consistent folder name on Barnacle. We've had some discussions about how best to do this but I don't know that we've formally adopted a consensus yet!
So, 1 & 2: Correct, it doesn't exist yet in the Qiita db; and 3: I think it would be good to add an additional column to the sample name titled "Project_Name" that includes the long name while the protected column "Sample_Project" contains the short name.
-
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#204 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ANq27eeIAzcpfhBHdWi60_0AZo57WP9pks5teDKEgaJpZM4SMpYh>.
|
@tanaes @jdereus Here are the conclusions we reached in our discussion today; please correct anything I've gotten wrong:
|
@tanaes @jdereus As I look at our conclusions, I find I have one concern: are samples from any study ever sequenced more than once? Like, maybe if the original sequencing run produced bad data?? If so, the naming convention we laid out above will not keep the sequencing results of the multiple runs separate. In the approach we chose, every sequencing run for a given study X will have the same project shortname, which could result in data from multiple runs being aggregated together into the same barnacle folder after demuxing (if folders persist between runs). Depending on the how smart the naming conventions used in the demuxing code are (of which I know nothing), it is even possible that old demuxed data might be overwritten with new demuxed data. Do these possibilities concern you? If so, how would you like to proceed? |
@AmandaBirmingham @jdereus and I when discussing this system envisioned that within the output project folder, each run would get it's own run folder, retaining the Illumina-encoded run folder information (which includes things like the sequencing instrument serial number and flow cell serial number) so that we'd be able to maintain this per-run information, so I think it's ok to not have the sample names themselves be globally unique. |
I think that depends on how the data is pulled back. If, as I believe, the samples are pulled back based off of individual wells on specific plates, then it should not be an issue. But if there is some check somewhere that checks for uniqueness, then we might have an issue.
I also think we might be discussing two different things here. Output location after bcl conversion and sample naming on _any_ plate and well throughout entire system.
From: Jon Sanders <[email protected]<mailto:[email protected]>>
Reply-To: jdereus/labman <[email protected]<mailto:[email protected]>>
Date: Wednesday, March 21, 2018 at 11:59 AM
To: jdereus/labman <[email protected]<mailto:[email protected]>>
Cc: "Dereus, Jeff" <[email protected]<mailto:[email protected]>>, Mention <[email protected]<mailto:[email protected]>>
Subject: Re: [jdereus/labman] update sample sheet generation to handle project shortname (#204)
@AmandaBirmingham<https://github.com/amandabirmingham>@jdereus<https://github.com/jdereus> and I when discussing this system envisioned that within the output project folder, each run would get it's own run folder, retaining the Illumina-encoded run folder information (which includes things like the sequencing instrument serial number and flow cell serial number) so that we'd be able to maintain this per-run information, so I think it's ok to not have the sample names themselves be globally unique.
-
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#204 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ANq27Tb6Pex8DJ4-jlhIy4SHCvjh_VSSks5tgrEHgaJpZM4SMpYh>.
|
@tanaes Merrily wrote code for this and then realized an outstanding question:
|
Sweet! Let me consider this and get back to you. @ElDeveloper, this might
interface with the problem you're working on.
…-j
On Tue, Apr 24, 2018 at 9:54 AM Amanda Birmingham ***@***.***> wrote:
@tanaes <https://github.com/tanaes> Merrily wrote code for this and then
realized an outstanding question:
- What do you want the "project shortname" to be for non-experimental
samples (blanks, empties, controls, etc)? I'll need to put something (if
only an empty string or an None) into the array at the positions for those
wells ...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#204 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AH6JAFfBI0fDktMW7TXGwk-ruhyZwDSmks5tr1ingaJpZM4SMpYh>
.
|
@tanaes Also, this question is not directly related to producing the project shortname, but I just want to double-check: what you want as the Sample_ID, Sample_Name, etc, in the shotgun sample sheet are strings that are the sample ids plus the plate and well they were plated on in the original sample plate? As in, "1_SKB1_640202_21_A1" (where the actual sample_id in qiita.study_sample is "1_SKB1_640202")? |
We don't actually need to have the plate info -- just the study + sample identifier is ok, i don't want to encode extraneous data in the filename. TBH I'd rather have a non-human-readible unique identifier but I don't think that will work in our system. |
@tanaes Just to be clear, my question above is about the contents of the "Sample_ID" and "Sample_Name" columns in the shotgun sample sheet that LabPerson generates; as far as I know, these values aren't file names (or are you saying they are used as that, somewhere downstream)? As I said, this question strays a bit from the task of generating the project shortname; sorry :) I just wanted to double-check that, whatever you use the sample sheet for after getting back the sequencing data, you actively want these "sample id plus position" descriptors in it rather than the actual keys that would allow you to, say, look up the sample metadata in Qiita (without having to strip off extraneous plate id and well position pieces at the end of the string). If you DO want the "sample id plus position" info (or you just don't care :) ), then all is copacetic. If actual Qiita sample ids would be more useful to you, it would be very easy to put them in the shotgun sample sheet instead of the "sample id plus position" ids. |
…on what project shortname should be for blanks/empties/etc. Currently just outputs empty string for those project shortnames.
Based on how the pipeline currently works, I think providing the actual Qiita Sample IDs to the format_sample_sheet machinery is going to be best. They do end up being munged into filenames (we end up encoding both the Qiita sample ID and the Illumna BCL2Fastq-compatible name in the sample sheet, and the latter is what gets prepended to the fastq filename), but we want to retain the original Qiita ID for when we rename these files later one. Currently this process is all already handled, and so I don't think
This is a problem that keeps rearing its head. The informal idiom we've used is that -- typically -- only a single study is included on an extraction plate. Those blanks then get inherited by that study (study-level association), with a rather inconsistent naming convention which may or may not specify well and plate number within the study. Really I think this is best solved by having the study-level modality to the sample plating interface that @ElDeveloper and I were discussing as a way to avoid having to display the long study identifier in the window. This would allow any extraction-level controls to be unambiguously associated with a particular project, which I think should be the preferred way to do it. Anything downstream of that (e.g. leftover wells in library prep plates) I think would be ok having a 'None'-equivalent study identified and project shortname. An alternative would be to have a 'Controls' study that combined all of these types of samples. This is maybe not such a terrible idea, as appropriate controls for a given plate or process could in principle be queried from the database. |
OK, after chatting about this with some folks, it seems like the best option vis-a-vis the study sheet is to have any of the controls on a sequencing run end up in a 'Controls' demultiplex folder after BCL2Fastq. We don't necessarily need to make this an actual Qiita study, but that would enable a uniform place to access control samples downstream. |
@AmandaBirmingham Just checking in on this because I think we're pretty dang close to being able to live test this project. Is the only thing left here after merging 202ce44 to modify the create sample sheet code to insert this additional field? |
…le sheet to hold short projectname for each sample, or "Controls" for non-experimental samples (control, empty, blank, etc).
… Sample_Project column based on #204 fix
We're using the
sample_proj
field in the Sample Sheets as a unique reference for demultiplexing to assign demuxed fastq files automatically to their project folder.We need to update the sample sheet generation to read in a per-sample value:
https://github.com/jdereus/labman/blob/0b36cfb63e8222f078f2479c9651e225864fb9c1/labman/db/process.py#L2577
And we may also need to add a column to the projects table to allow a project shortname in a format like 'PI_qiitaID' or similar.
The text was updated successfully, but these errors were encountered: