Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update sample sheet generation to handle project shortname #204

Closed
tanaes opened this issue Feb 20, 2018 · 15 comments
Closed

update sample sheet generation to handle project shortname #204

tanaes opened this issue Feb 20, 2018 · 15 comments

Comments

@tanaes
Copy link
Collaborator

tanaes commented Feb 20, 2018

We're using the sample_proj field in the Sample Sheets as a unique reference for demultiplexing to assign demuxed fastq files automatically to their project folder.

We need to update the sample sheet generation to read in a per-sample value:
https://github.com/jdereus/labman/blob/0b36cfb63e8222f078f2479c9651e225864fb9c1/labman/db/process.py#L2577

And we may also need to add a column to the projects table to allow a project shortname in a format like 'PI_qiitaID' or similar.

@AmandaBirmingham
Copy link
Collaborator

@tanaes Some clarification questions:

  1. You note "we may also need to add a column to the projects table", but I can't find a projects table in the labman/qiita db! Are we perchance talking about the study table? (I ask because I see that the fix for issue Plates List screen (Project names) #32, Plates List screen (Project names), got the "project names" it displays from the qiita.study.title value). If not, has a project table been added at some point but didn't make it into the labman db patch scripts? Or does it have a name other than projects?
  2. If we are talking about the qiita.study table, is the project shortname already in there? For example, is it the study_id field or the study_alias field?
  3. Right now the sample_proj parameter to _format_sample_sheet_data is being set to the sequencing process's run name. After we change sample_proj to instead hold the per-sample project shortname, does the sequencing process's run name need to be included anywhere (else) in the sample sheet output?

@tanaes
Copy link
Collaborator Author

tanaes commented Mar 13, 2018

I don't think that this is a field that exists in Qiita yet necessarily. CC @jdereus — the thing is we need something that can link up to a consistent folder name on Barnacle. We've had some discussions about how best to do this but I don't know that we've formally adopted a consensus yet!

So, 1 & 2: Correct, it doesn't exist yet in the Qiita db; and 3: I think it would be good to add an additional column to the sample name titled "Project_Name" that includes the long name while the protected column "Sample_Project" contains the short name.

@jdereus
Copy link
Collaborator

jdereus commented Mar 13, 2018 via email

@AmandaBirmingham
Copy link
Collaborator

AmandaBirmingham commented Mar 15, 2018

@tanaes @jdereus Here are the conclusions we reached in our discussion today; please correct anything I've gotten wrong:

  • A project shortname will be generated programmatically from the study information for each sample; Labman users will not need to enter (or even know) the shortname.
  • Project shortname format will be either <PI_name>_<study_ID> or <Lab_person_name>_<PI_name>_<study_ID> ; Amanda will generate samples for some existing studies in each of these formats and run them by Greg for a decision on which is more usable.
  • The project shortname will NOT be stored in the database anywhere. Instead, the gold-standard repository of the project shortname of any sample is the sample sheet for that sample's sequencing run. These sample sheets are expected to be saved indefinitely; as long as this is the case, it will always be possible to go back and find the project shortname used for a sample (even if the rule used to generate project shortnames from the database has changed since that sample was sequenced).

@AmandaBirmingham
Copy link
Collaborator

@tanaes @jdereus As I look at our conclusions, I find I have one concern: are samples from any study ever sequenced more than once? Like, maybe if the original sequencing run produced bad data?? If so, the naming convention we laid out above will not keep the sequencing results of the multiple runs separate.

In the approach we chose, every sequencing run for a given study X will have the same project shortname, which could result in data from multiple runs being aggregated together into the same barnacle folder after demuxing (if folders persist between runs). Depending on the how smart the naming conventions used in the demuxing code are (of which I know nothing), it is even possible that old demuxed data might be overwritten with new demuxed data.

Do these possibilities concern you? If so, how would you like to proceed?

@tanaes
Copy link
Collaborator Author

tanaes commented Mar 21, 2018

@AmandaBirmingham @jdereus and I when discussing this system envisioned that within the output project folder, each run would get it's own run folder, retaining the Illumina-encoded run folder information (which includes things like the sequencing instrument serial number and flow cell serial number) so that we'd be able to maintain this per-run information, so I think it's ok to not have the sample names themselves be globally unique.

@jdereus
Copy link
Collaborator

jdereus commented Mar 21, 2018 via email

@AmandaBirmingham
Copy link
Collaborator

@tanaes Merrily wrote code for this and then realized an outstanding question:

  • What do you want the "project shortname" to be for non-experimental samples (blanks, empties, controls, etc)? I'll need to put something (if only an empty string or an None) into the array at the positions for those wells ...

@tanaes
Copy link
Collaborator Author

tanaes commented Apr 24, 2018 via email

@AmandaBirmingham
Copy link
Collaborator

@tanaes Also, this question is not directly related to producing the project shortname, but I just want to double-check: what you want as the Sample_ID, Sample_Name, etc, in the shotgun sample sheet are strings that are the sample ids plus the plate and well they were plated on in the original sample plate? As in, "1_SKB1_640202_21_A1" (where the actual sample_id in qiita.study_sample is "1_SKB1_640202")?

@tanaes
Copy link
Collaborator Author

tanaes commented Apr 24, 2018

We don't actually need to have the plate info -- just the study + sample identifier is ok, i don't want to encode extraneous data in the filename. TBH I'd rather have a non-human-readible unique identifier but I don't think that will work in our system.

@AmandaBirmingham
Copy link
Collaborator

@tanaes Just to be clear, my question above is about the contents of the "Sample_ID" and "Sample_Name" columns in the shotgun sample sheet that LabPerson generates; as far as I know, these values aren't file names (or are you saying they are used as that, somewhere downstream)? As I said, this question strays a bit from the task of generating the project shortname; sorry :)

I just wanted to double-check that, whatever you use the sample sheet for after getting back the sequencing data, you actively want these "sample id plus position" descriptors in it rather than the actual keys that would allow you to, say, look up the sample metadata in Qiita (without having to strip off extraneous plate id and well position pieces at the end of the string). If you DO want the "sample id plus position" info (or you just don't care :) ), then all is copacetic. If actual Qiita sample ids would be more useful to you, it would be very easy to put them in the shotgun sample sheet instead of the "sample id plus position" ids.

AmandaBirmingham added a commit that referenced this issue Apr 24, 2018
…on what project shortname should be for blanks/empties/etc. Currently just outputs empty string for those project shortnames.
@tanaes
Copy link
Collaborator Author

tanaes commented Apr 27, 2018

  1. Sample names

Based on how the pipeline currently works, I think providing the actual Qiita Sample IDs to the format_sample_sheet machinery is going to be best. They do end up being munged into filenames (we end up encoding both the Qiita sample ID and the Illumna BCL2Fastq-compatible name in the sample sheet, and the latter is what gets prepended to the fastq filename), but we want to retain the original Qiita ID for when we rename these files later one. Currently this process is all already handled, and so I don't think

  1. Freaking blanks

This is a problem that keeps rearing its head. The informal idiom we've used is that -- typically -- only a single study is included on an extraction plate. Those blanks then get inherited by that study (study-level association), with a rather inconsistent naming convention which may or may not specify well and plate number within the study.

Really I think this is best solved by having the study-level modality to the sample plating interface that @ElDeveloper and I were discussing as a way to avoid having to display the long study identifier in the window. This would allow any extraction-level controls to be unambiguously associated with a particular project, which I think should be the preferred way to do it. Anything downstream of that (e.g. leftover wells in library prep plates) I think would be ok having a 'None'-equivalent study identified and project shortname.

An alternative would be to have a 'Controls' study that combined all of these types of samples. This is maybe not such a terrible idea, as appropriate controls for a given plate or process could in principle be queried from the database.

tanaes added a commit that referenced this issue May 4, 2018
@tanaes
Copy link
Collaborator Author

tanaes commented May 7, 2018

OK, after chatting about this with some folks, it seems like the best option vis-a-vis the study sheet is to have any of the controls on a sequencing run end up in a 'Controls' demultiplex folder after BCL2Fastq. We don't necessarily need to make this an actual Qiita study, but that would enable a uniform place to access control samples downstream.

@tanaes
Copy link
Collaborator Author

tanaes commented May 9, 2018

@AmandaBirmingham Just checking in on this because I think we're pretty dang close to being able to live test this project.

Is the only thing left here after merging 202ce44 to modify the create sample sheet code to insert this additional field?

AmandaBirmingham added a commit that referenced this issue May 10, 2018
…le sheet to hold short projectname for each sample, or "Controls" for non-experimental samples (control, empty, blank, etc).
AmandaBirmingham added a commit that referenced this issue May 10, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants