Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adds pangolin 4.3.1 with new pdata 1.30 #1052

Merged
merged 2 commits into from
Sep 27, 2024
Merged

adds pangolin 4.3.1 with new pdata 1.30 #1052

merged 2 commits into from
Sep 27, 2024

Conversation

kapsakcj
Copy link
Collaborator

@kapsakcj kapsakcj commented Sep 23, 2024

Changes from previous dockerfile

  • updating to pangolin-data & pangolin-assignement 1.30
  • upgraded base image to latest micromamba-docker version
  • added unzip -o option to overwrite files named the same (like md5sum.txt that comes with genomes downloaded using ncbi datasets). These files are unnecessary to keep around so I just opted to overwrite them each time a .zip file is unzipped & files are extracted.
  • added test for new lineage, MC.2

code diff:

$ diff pangolin/4.3.1-pdata-1.29/Dockerfile pangolin/4.3.1-pdata-1.30/Dockerfile 
1c1
< FROM mambaorg/micromamba:1.5.8 AS app
---
> FROM mambaorg/micromamba:1.5.10 AS app
12c12
< ARG PANGOLIN_DATA_VER="v1.29"
---
> ARG PANGOLIN_DATA_VER="v1.30"
18c18
< LABEL base.image="mambaorg/micromamba:1.5.8"
---
> LABEL base.image="mambaorg/micromamba:1.5.10"
126c126
<  unzip OQ381818.1.zip && rm OQ381818.1.zip && \
---
>  unzip -o OQ381818.1.zip && rm OQ381818.1.zip && \
135c135
< unzip OR177999.1.zip && rm OR177999.1.zip && \
---
> unzip -o OR177999.1.zip && rm OR177999.1.zip && \
144c144
< unzip OR461132.1.zip && rm OR461132.1.zip && \
---
> unzip -o OR461132.1.zip && rm OR461132.1.zip && \
153c153
< unzip OR598183.1.zip && rm OR598183.1.zip && \
---
> unzip -o OR598183.1.zip && rm OR598183.1.zip && \
164c164
< unzip OR716684.1.zip && rm OR716684.1.zip && \
---
> unzip -o OR716684.1.zip && rm OR716684.1.zip && \
174c174
< unzip PP189069.1.zip && rm PP189069.1.zip && \
---
> unzip -o PP189069.1.zip && rm PP189069.1.zip && \
185c185
< unzip PP218754.1.zip && rm PP218754.1.zip && \
---
> unzip -o PP218754.1.zip && rm PP218754.1.zip && \
195c195
< unzip PP770375.1.zip && rm PP770375.1.zip && \
---
> unzip -o PP770375.1.zip && rm PP770375.1.zip && \
204c204
< unzip PQ073669.1.zip && rm PQ073669.1.zip && \
---
> unzip -o PQ073669.1.zip && rm PQ073669.1.zip && \
208c208,217
< column -t -s, PQ073669.1-usher/lineage_report.csv
\ No newline at end of file
---
> column -t -s, PQ073669.1-usher/lineage_report.csv
> 
> # new lineage MC.2 that was introduced in pango-designation v1.30: https://github.com/cov-lineages/pango-designation/commit/c64dbc47fbfbfd7f4da011deeb1a88dd6baa45f1#diff-a121ea4b8cbeb4c0020511b5535bf24489f0223cc83511df7b8209953115d329R2564181
> # genome on NCBI: https://www.ncbi.nlm.nih.gov/nuccore/PQ034842.1
> RUN datasets download virus genome accession PQ034842.1 --filename PQ034842.1.zip && \
> unzip -o PQ034842.1.zip && rm PQ034842.1.zip && \
> mv -v ncbi_dataset/data/genomic.fna PQ034842.1.genomic.fna && \
> rm -vr ncbi_dataset/ README.md && \
> pangolin PQ034842.1.genomic.fna -o PQ034842.1-usher && \
> column -t -s, PQ034842.1-usher/lineage_report.csv

Pull Request (PR) checklist:

  • Include a description of what is in this pull request in this message.
  • The dockerfile successfully builds to a test target for the user creating the PR. (i.e. docker build --tag samtools:1.15test --target test docker-builds/samtools/1.15 )
  • Directory structure as name of the tool in lower case with special characters removed with a subdirectory of the version number (i.e. spades/3.12.0/Dockerfile)
    • (optional) All test files are located in same directory as the Dockerfile (i.e. shigatyper/2.0.1/test.sh)
  • Create a simple container-specific README.md in the same directory as the Dockerfile (i.e. spades/3.12.0/README.md)
    • If this README is longer than 30 lines, there is an explanation as to why more detail was needed
  • Dockerfile includes the recommended LABELS
  • Main README.md has been updated to include the tool and/or version of the dockerfile(s) in this PR
  • Program_Licenses.md contains the tool(s) used in this PR and has been updated for any missing

Sorry, something went wrong.

@kapsakcj kapsakcj marked this pull request as ready for review September 23, 2024 18:10
@kapsakcj
Copy link
Collaborator Author

Does anyone have the bandwidth to review this PR?

@erinyoung
Copy link
Contributor

It looks like the tests work.

#19 [test  7/17] RUN datasets download virus genome accession ON924087.1 --filename ON924087.1.zip &&  unzip ON924087.1.zip && rm ON924087.1.zip &&  mv -v ncbi_dataset/data/genomic.fna ON924087.1.genomic.fna &&  rm -vr ncbi_dataset/ README.md &&  pangolin ON924087.1.genomic.fna -o ON924087.1-usher &&  column -t -s, ON924087.1-usher/lineage_report.csv
#19 0.658 Downloading: ON924087.1.zip    847B 11.2MB/s
#19 0.669 Downloading: ON924087.1.zip    847B 11.2MB/s
#19 0.679 Downloading: ON924087.1.zip    847B 11.2MB/s
#19 0.689 Downloading: ON924087.1.zip    847B 11.2MB/s
#19 0.699 Downloading: ON924087.1.zip    847B 11.2MB/s
#19 0.709 Downloading: ON924087.1.zip    847B 11.2MB/s
#19 0.720 Downloading: ON924087.1.zip    847B 11.2MB/s
#19 0.730 Downloading: ON924087.1.zip    847B 11.2MB/s
#19 0.740 Downloading: ON924087.1.zip    847B 11.2MB/s
#19 0.750 Downloading: ON924087.1.zip    847B 11.2MB/s
#19 0.760 Downloading: ON924087.1.zip    847B 11.2MB/s
#19 0.771 Downloading: ON924087.1.zip    847B 11.2MB/s
#19 0.781 Downloading: ON924087.1.zip    847B 11.2MB/s
#19 0.791 Downloading: ON924087.1.zip    847B 11.2MB/s
#19 0.801 Downloading: ON924087.1.zip    847B 11.2MB/s
#19 0.811 Downloading: ON924087.1.zip    847B 11.2MB/s
#19 0.821 Downloading: ON924087.1.zip    847B 11.2MB/s
#19 0.832 Downloading: ON924087.1.zip    847B 11.2MB/s
#19 0.842 Downloading: ON924087.1.zip    847B 11.2MB/s
#19 0.852 Downloading: ON924087.1.zip    847B 11.2MB/s
#19 0.862 Downloading: ON924087.1.zip    847B 11.2MB/s
#19 0.872 Downloading: ON924087.1.zip    847B 11.2MB/s
#19 0.883 Downloading: ON924087.1.zip    847B 11.2MB/s
#19 0.893 Downloading: ON924087.1.zip    847B 11.2MB/s
#19 0.903 Downloading: ON924087.1.zip    847B 11.2MB/s
#19 0.913 Downloading: ON924087.1.zip    847B 11.2MB/s
#19 0.923 Downloading: ON924087.1.zip    847B 11.2MB/s
#19 0.934 Downloading: ON924087.1.zip    847B 11.2MB/s
#19 0.944 Downloading: ON924087.1.zip    847B 11.2MB/s
#19 0.954 Downloading: ON924087.1.zip    847B 11.2MB/s
#19 0.964 Downloading: ON924087.1.zip    2.02kB 6.66kB/s
#19 0.974 Downloading: ON924087.1.zip    2.02kB 6.66kB/s
#19 0.985 Downloading: ON924087.1.zip    2.02kB 6.66kB/s
#19 0.995 Downloading: ON924087.1.zip    2.02kB 6.66kB/s
#19 1.005 Downloading: ON924087.1.zip    2.02kB 6.66kB/s
#19 1.015 Downloading: ON924087.1.zip    2.02kB 6.66kB/s
#19 1.025 Downloading: ON924087.1.zip    2.02kB 6.66kB/s
#19 1.036 Downloading: ON924087.1.zip    2.02kB 6.66kB/s
#19 1.046 Downloading: ON924087.1.zip    2.02kB 6.66kB/s
#19 1.056 Downloading: ON924087.1.zip    2.02kB 6.66kB/s
#19 1.066 Downloading: ON924087.1.zip    2.02kB 6.66kB/s
#29 12.88 ****
#29 12.88 Data files found:
#29 12.88 usher_pb:	/opt/conda/envs/pangolin/lib/python3.8/site-packages/pangolin_data/data/lineageTree.pb
#29 12.88 ****
#29 12.88 ****
#29 12.88 Output file written to: /data/PQ034842.1-usher/lineage_report.csv
#29 12.99 taxon                                                                                                           lineage            conflict  ambiguity_score  scorpio_call  scorpio_support      scorpio_conflict  scorpio_notes  version                                                                    pangolin_version  scorpio_version  constellation_version  is_designated  qc_status  qc_notes  note                    
#29 12.99 "PQ034842.1 Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/NY-CDC-LC1108001/2024   complete genome"  MC.2      0.0                            Omicron (BA.2-like)  0.92              0.02           scorpio call: Alt alleles 57; Ref alleles 1; Amb alleles 0; Oth alleles 4  PANGO-v1.30       4.3.1            0.3.19                 v0.1.12        True       pass      Ambiguous_content:0.02  Assigned from designation hash.
#29 DONE 13.0s

Comment on lines +112 to +145
# download assembly for a BA.1 from Florida (https://www.ncbi.nlm.nih.gov/biosample?term=SAMN29506515 and https://www.ncbi.nlm.nih.gov/nuccore/ON924087)
# run pangolin in usher analysis mode
RUN datasets download virus genome accession ON924087.1 --filename ON924087.1.zip && \
unzip ON924087.1.zip && rm ON924087.1.zip && \
mv -v ncbi_dataset/data/genomic.fna ON924087.1.genomic.fna && \
rm -vr ncbi_dataset/ README.md && \
pangolin ON924087.1.genomic.fna -o ON924087.1-usher && \
column -t -s, ON924087.1-usher/lineage_report.csv

# test specific for new lineage, XBB.1.16, introduced in pangolin-data v1.19
# using this assembly: https://www.ncbi.nlm.nih.gov/nuccore/2440446687
# biosample here: https://www.ncbi.nlm.nih.gov/biosample?term=SAMN33060589
# one of the sample included in initial pango-designation here: https://github.com/cov-lineages/pango-designation/issues/1723
RUN datasets download virus genome accession OQ381818.1 --filename OQ381818.1.zip && \
unzip -o OQ381818.1.zip && rm OQ381818.1.zip && \
mv -v ncbi_dataset/data/genomic.fna OQ381818.1.genomic.fna && \
rm -vr ncbi_dataset/ README.md && \
pangolin OQ381818.1.genomic.fna -o OQ381818.1-usher && \
column -t -s, OQ381818.1-usher/lineage_report.csv

# testing another XBB.1.16, trying to test scorpio functionality. Want pangolin to NOT assign lineage based on pango hash match.
# this test runs as expected, uses scorpio to check for constellation of mutations, then assign using PUSHER placement
RUN datasets download virus genome accession OR177999.1 --filename OR177999.1.zip && \
unzip -o OR177999.1.zip && rm OR177999.1.zip && \
mv -v ncbi_dataset/data/genomic.fna OR177999.1.genomic.fna && \
rm -vr ncbi_dataset/ README.md && \
pangolin OR177999.1.genomic.fna -o OR177999.1-usher && \
column -t -s, OR177999.1-usher/lineage_report.csv

## test for BA.2.86
# virus identified in MI: https://www.ncbi.nlm.nih.gov/nuccore/OR461132.1
RUN datasets download virus genome accession OR461132.1 --filename OR461132.1.zip && \
unzip -o OR461132.1.zip && rm OR461132.1.zip && \
mv -v ncbi_dataset/data/genomic.fna OR461132.1.genomic.fna && \
Copy link
Contributor

@erinyoung erinyoung Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this list of tests is getting very long, and some of what we WERE testing for, we aren't being tested in this image.

What if we did something like

# testing the following lineages:
# XBB.1.16 OR177999.1
# BA.2.86 OR461132.1
# JN.2 (BA.2.86 sublineage) JN.2 is an alias of B.1.1.529.2.86.1.2 OR598183.1 (NY CDC Quest sample)
# Q.1 (BA.2.86.3 sublineage); JQ.1 is an alias of B.1.1.529.2.86.3.1 OR716684.1
# JN.1.22 (BA.2.86.x sublineage; full unaliased lineage is B.1.1.529.2.86.1.1.22) PP189069.1 
# JN.1.48 (BA.2.86.x sublineage; full unaliased lineage is B.1.1.529.2.86.1.1.48) PP218754.1
# and so on...
RUN datasets download virus genome accession OQ381818.1,OR177999.1,OR461132.1,OR598183.1,OR716684.1,PP189069.1,PP218754.1,PP770375.1,PQ073669.1,PQ034842.1 && \
unzip ncbi_dataset.zip && \
pangolin ncbi_dataset/data/genomic.fna && \
column -t -s, lineage_report.csv

@erinyoung erinyoung merged commit 6b290e2 into master Sep 27, 2024
2 checks passed
@erinyoung
Copy link
Contributor

Thank you for putting this together!

I'm going to deploy this to dockerhub and quay to staphb/pangolin using the tags '4.3.1-pdata-1.30' and 'latest'

@kapsakcj kapsakcj deleted the cjk-pdata-1.30 branch November 22, 2024 15:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants