Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added seqsero2 version 1.3.1 #956

Merged
merged 3 commits into from
Apr 12, 2024
Merged

added seqsero2 version 1.3.1 #956

merged 3 commits into from
Apr 12, 2024

Conversation

erinyoung
Copy link
Contributor

There's a new version of seqsero2!

New to 1.3.1:

• Converted the sequences of some alleles to their reverse complement sequences in the SeqSero2 database.
• Deleted some alleles from the SeqSero2 database because of the existence of mutations.
• Added a fliC 1,5,7 allele and a fliC 1,2,7 allele into the SeqSero2 database.
• Deleted the O54 allele.
• Fixed the bug that caused the misidentification of O9 and O2 by the micro-assembly workflow.
• Used Spades v3.15.5 to prevent the bug when running the micro-assembly workflow in Mac with M2 chip.

More information here: https://github.com/denglab/SeqSero2/releases

I did attempt to build this in jammy and focal.

The error with jammy:

14.77       building 'Bio.Align._aligners' extension
14.77       creating build/temp.linux-x86_64-3.10
14.77       creating build/temp.linux-x86_64-3.10/Bio
14.77       creating build/temp.linux-x86_64-3.10/Bio/Align
14.77       x86_64-linux-gnu-gcc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/include/python3.10 -c Bio/Align/_aligners.c -o build/temp.linux-x86_64-3.10/Bio/Align/_aligners.o
14.77       Bio/Align/_aligners.c:11:10: fatal error: Python.h: No such file or directory
14.77          11 | #include "Python.h"
14.77             |          ^~~~~~~~~~
14.77       compilation terminated.
14.77       error: command '/usr/bin/x86_64-linux-gnu-gcc' failed with exit code 1
14.77       [end of output]
14.77   
14.77   note: This error originates from a subprocess, and is likely not a problem with pip.
14.77 error: legacy-install-failure
14.77 
14.77 × Encountered error while trying to install package.
14.77 ╰─> biopython
14.77 
14.77 note: This is an issue with the package mentioned above, not pip.
14.77 hint: See above for output from the failure.

The error with focal:

1.980 Get:7 http://archive.ubuntu.com/ubuntu focal/restricted amd64 Packages [33.4 kB]
1.980 Get:8 http://archive.ubuntu.com/ubuntu focal/multiverse amd64 Packages [177 kB]
1.980 Get:9 http://archive.ubuntu.com/ubuntu focal/main amd64 Packages [1275 kB]
2.009 Get:10 http://archive.ubuntu.com/ubuntu focal-updates/universe amd64 Packages [1489 kB]
2.037 Get:11 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 Packages [4021 kB]
2.136 Get:12 http://archive.ubuntu.com/ubuntu focal-updates/restricted amd64 Packages [3634 kB]
2.206 Get:13 http://archive.ubuntu.com/ubuntu focal-updates/multiverse amd64 Packages [32.5 kB]
2.206 Get:14 http://archive.ubuntu.com/ubuntu focal-backports/universe amd64 Packages [28.6 kB]
2.206 Get:15 http://archive.ubuntu.com/ubuntu focal-backports/main amd64 Packages [55.2 kB]
2.484 Get:16 http://security.ubuntu.com/ubuntu focal-security/restricted amd64 Packages [3483 kB]
2.769 Get:17 http://security.ubuntu.com/ubuntu focal-security/main amd64 Packages [3546 kB]
2.937 Get:18 http://security.ubuntu.com/ubuntu focal-security/multiverse amd64 Packages [29.8 kB]
3.301 Fetched 30.9 MB in 2s (12.8 MB/s)
3.301 Reading package lists...
4.213 Reading package lists...
5.063 Building dependency tree...
5.239 Reading state information...
5.258 Package sra-toolkit is not available, but is referred to by another package.
5.258 This may mean that the package is missing, has been obsoleted, or
5.258 is only available from another source
5.258 
5.263 E: Package 'sra-toolkit' has no installation candidate

This meant that essentially all I did was mess with some tabs, add a CMD line, and update the sofware ARG

Full diff between 1.2.1 and 1.3.1

$ diff seqsero2/1.2.1/Dockerfile seqsero2/1.3.1/Dockerfile 
4c4
< ARG SEQSERO2_VER="1.2.1"
---
> ARG SEQSERO2_VER="1.3.1"
6,7c6,7
< ARG SAMTOOLS_VER="1.9"
< ARG SALMID_VER="0.122"
---
> ARG SAMTOOLS_VER="1.8"
> ARG SALMID_VER="0.11"
11c11
< LABEL dockerfile.version="2"
---
> LABEL dockerfile.version="1"
13c13
< LABEL software.version="1.2.1"
---
> LABEL software.version="${SEQSERO2_VER}"
32,47c32,47
<  python3 \
<  python3-pip \
<  python3-setuptools \
<  bwa \
<  ncbi-blast+ \
<  sra-toolkit \
<  bedtools \
<  wget \
<  ca-certificates \
<  unzip \
<  zlib1g-dev \
<  libbz2-dev \
<  liblzma-dev \
<  build-essential \
<  libncurses5-dev && \
<  rm -rf /var/lib/apt/lists/* && apt-get autoclean
---
>   python3 \
>   python3-pip \
>   python3-setuptools \
>   bwa \
>   ncbi-blast+ \
>   sra-toolkit \
>   bedtools \
>   wget \
>   ca-certificates \
>   unzip \
>   zlib1g-dev \
>   libbz2-dev \
>   liblzma-dev \
>   build-essential \
>   libncurses5-dev && \
>   rm -rf /var/lib/apt/lists/* && apt-get autoclean
50,56c50,56
< RUN wget https://github.com/samtools/samtools/releases/download/${SAMTOOLS_VER}/samtools-${SAMTOOLS_VER}.tar.bz2 && \
<  tar -xjf samtools-${SAMTOOLS_VER}.tar.bz2 && \
<  rm -v samtools-${SAMTOOLS_VER}.tar.bz2 && \
<  cd samtools-${SAMTOOLS_VER} && \
<  ./configure && \
<  make && \
<  make install
---
> RUN wget -q https://github.com/samtools/samtools/releases/download/${SAMTOOLS_VER}/samtools-${SAMTOOLS_VER}.tar.bz2 && \
>   tar -xjf samtools-${SAMTOOLS_VER}.tar.bz2 && \
>   rm -v samtools-${SAMTOOLS_VER}.tar.bz2 && \
>   cd samtools-${SAMTOOLS_VER} && \
>   ./configure && \
>   make && \
>   make install
59,61c59,61
< RUN wget https://github.com/hcdenbakker/SalmID/archive/${SALMID_VER}.tar.gz && \
<  tar -xzf ${SALMID_VER}.tar.gz && \
<  rm -rvf ${SALMID_VER}.tar.gz
---
> RUN wget -q https://github.com/hcdenbakker/SalmID/archive/${SALMID_VER}.tar.gz && \
>   tar -xzf ${SALMID_VER}.tar.gz && \
>   rm -rvf ${SALMID_VER}.tar.gz
64c64
< RUN wget http://cab.spbu.ru/files/release${SPADES_VER}/SPAdes-${SPADES_VER}-Linux.tar.gz && \
---
> RUN wget -q https://github.com/ablab/spades/releases/download/v${SPADES_VER}/SPAdes-${SPADES_VER}-Linux.tar.gz && \
66c66
<   rm -rv SPAdes-${SPADES_VER}-Linux.tar.gz
---
>   rm -r SPAdes-${SPADES_VER}-Linux.tar.gz
69,74c69,74
< RUN wget https://github.com/denglab/SeqSero2/archive/v${SEQSERO2_VER}.tar.gz && \
<  tar -xzf v${SEQSERO2_VER}.tar.gz && \
<  rm -vrf v${SEQSERO2_VER}.tar.gz && \
<  cd /SeqSero2-${SEQSERO2_VER}/ && \
<  python3 -m pip install . && \
<  mkdir /data
---
> RUN wget -q https://github.com/denglab/SeqSero2/archive/v${SEQSERO2_VER}.tar.gz && \
>   tar -xzf v${SEQSERO2_VER}.tar.gz && \
>   rm -vrf v${SEQSERO2_VER}.tar.gz && \
>   cd /SeqSero2-${SEQSERO2_VER}/ && \
>   python3 -m pip install . && \
>   mkdir /data
79a80,81
> CMD SeqSero2_package.py --help
> 
87,89c89,91
< RUN wget https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets && \
<  chmod +x datasets && \
<  mv -v datasets /usr/local/bin
---
> RUN wget -q https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets && \
>   chmod +x datasets && \
>   mv -v datasets /usr/local/bin
111,112c113,114
< RUN wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR608/003/SRR6082043/SRR6082043_1.fastq.gz && \
<     wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR608/003/SRR6082043/SRR6082043_2.fastq.gz && \
---
> RUN wget -q ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR608/003/SRR6082043/SRR6082043_1.fastq.gz && \
>     wget -q ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR608/003/SRR6082043/SRR6082043_2.fastq.gz && \

Pull Request (PR) checklist:

  • Include a description of what is in this pull request in this message.
  • The dockerfile successfully builds to a test target for the user creating the PR. (i.e. docker build --tag samtools:1.15test --target test docker-builds/samtools/1.15 )
  • Directory structure as name of the tool in lower case with special characters removed with a subdirectory of the version number (i.e. spades/3.12.0/Dockerfile)
    • (optional) All test files are located in same directory as the Dockerfile (i.e. shigatyper/2.0.1/test.sh)
  • Create a simple container-specific README.md in the same directory as the Dockerfile (i.e. spades/3.12.0/README.md)
    • If this README is longer than 30 lines, there is an explanation as to why more detail was needed
  • Dockerfile includes the recommended LABELS
  • Main README.md has been updated to include the tool and/or version of the dockerfile(s) in this PR
  • Program_Licenses.md contains the tool(s) used in this PR and has been updated for any missing

@Kincekara
Copy link
Collaborator

@erinyoung you can resolve the issue in Jammy by installing the python3-dev package.
There is still space for improvement in this image, but we can go ahead and skip if you need it soon.

@erinyoung
Copy link
Contributor Author

Adding in the python3-dev doesn't fix all the problems. It does allow everything to seem like it's installed, but when seqsero2 is actually run it doesn't function.

The following should be an 'Infantis'.

73.21 Sample name:      SRR6082043
73.21 Output directory: /test/SRR6082043-seqsero2-reads-allele-mode
73.21 Input files:      /test/SRR6082043_1.fastq.gz     /test/SRR6082043_2.fastq.gz
73.21 O antigen prediction:     -
73.21 H1 antigen prediction(fliC):      -
73.21 H2 antigen prediction(fljB):      -
73.21 Predicted identification: Salmonella enterica subspecies enterica (subspecies I)
73.21 Predicted antigenic profile:      -:-:-
73.21 Predicted serotype:       I -:-:-
73.21 Note:     No serotype antigens were detected. This is an atypical result that should be further investigated. 

@erinyoung
Copy link
Contributor Author

Seqsero2 in this instance also doesn't give any clear indication on why it fails in jammy

RUN SeqSero2_package.py --help && SeqSero2_package.py --check && SeqSero2_package.py --version && whatever:
0.424 /usr/local/bin/SeqSero2_package.py:13: DeprecationWarning: The distutils package is deprecated and slated for removal in Python 3.12. Use setuptools or check PEP 632 for potential alternatives
0.424   from distutils.version import LooseVersion
0.431 usage: SeqSero2_package.py -t <data_type> -m <mode> -i <input_data> [-d <output_directory>] [-p <number of threads>] [-b <BWA_algorithm>]
0.431 
0.431 Developper: Shaokang Zhang ([email protected]), Hendrik C Den-Bakker ([email protected]) and Xiangyu Deng ([email protected])
0.431 
0.431 Contact email:[email protected]
0.431 
0.431 Version: v1.3.1
0.431 
0.431 options:
0.431   -h, --help            show this help message and exit
0.431   -i I [I ...]          <string>: path/to/input_data
0.431   -t {1,2,3,4,5}        <int>: '1' for interleaved paired-end reads, '2' for
0.431                         separated paired-end reads, '3' for single reads, '4'
0.431                         for genome assembly, '5' for nanopore reads
0.431                         (fasta/fastq)
0.431   -b {sam,mem}          <string>: algorithms for bwa mapping for allele mode;
0.431                         'mem' for mem, 'sam' for samse/sampe; default=mem;
0.431                         optional; for now we only optimized for default 'mem'
0.431                         mode
0.431   -p P                  <int>: number of threads for allele mode, if p >4,
0.431                         only 4 threads will be used for assembly since the
0.431                         amount of extracted reads is small, default=1
0.431   -m {k,a}              <string>: which workflow to apply, 'a'(raw reads
0.431                         allele micro-assembly), 'k'(raw reads and genome
0.431                         assembly k-mer), default=a
0.431   -n N                  <string>: optional, to specify a sample name in the
0.431                         report output
0.431   -d D                  <string>: optional, to specify an output directory
0.431                         name, if not set, the output directory would be
0.431                         'SeqSero_result_'+time stamp+one random number
0.431   -c                    <flag>: if '-c' was flagged, SeqSero2 will only output
0.431                         serotype prediction without the directory containing
0.431                         log files
0.431   -s                    <flag>: if '-s' was flagged, SeqSero2 will not output
0.431                         header in SeqSero_result.tsv
0.431   --phred_offset {33,64,auto}
0.431                         <33|64|auto>: offset for FASTQ file quality scores,
0.431                         default=auto
0.431   --check               <flag>: use '--check' flag to check the required
0.431                         dependencies
0.431   -v, --version         show program's version number and exit
0.468 /usr/local/bin/SeqSero2_package.py:13: DeprecationWarning: The distutils package is deprecated and slated for removal in Python 3.12. Use setuptools or check PEP 632 for potential alternatives
0.468   from distutils.version import LooseVersion
0.473 Using bwa - /usr/bin/bwa
0.473 Using samtools - /usr/local/bin/samtools
0.473 Using blastn - /usr/bin/blastn
0.473 Using fastq-dump - /usr/bin/fastq-dump
0.473 Using spades.py - /SPAdes-3.15.5-Linux/bin/spades.py
0.473 Using bedtools - /usr/bin/bedtools
0.473 Using SalmID.py - /SalmID-0.11/SalmID.py
0.510 /usr/local/bin/SeqSero2_package.py:13: DeprecationWarning: The distutils package is deprecated and slated for removal in Python 3.12. Use setuptools or check PEP 632 for potential alternatives
0.510   from distutils.version import LooseVersion
0.516 SeqSero2_package.py 1.3.1

@kapsakcj
Copy link
Collaborator

looks like something was deprecated. Not sure what version needs to be rolled back but perhaps roll back python to a previous version? Or perhaps that distutils package specifically?

@erinyoung
Copy link
Contributor Author

Or we could keep it in bionic? It seems happy there.

@kapsakcj
Copy link
Collaborator

Yeah, I think that's the easiest path forward. Bionic isn't EOL until April 2028 so I say go for it

@Kincekara
Copy link
Collaborator

For your information. Spades causes the problem. It seeks Python instead of python3. I tried update-alternatives --install /usr/bin/python python /usr/bin/python3 1 and got more errors from Spades. Bionic seems the fastest solution for now.

@Kincekara
Copy link
Collaborator

I have no more recommendations. @kapsakcj do you want to test/review it?

@kapsakcj
Copy link
Collaborator

yes, i would like to do some testing on it, please hold on merging

@kapsakcj
Copy link
Collaborator

I shaved off ~7MB by adding lines to delete samtools source code since the executables get copied to /usr/local/bin with the make install command.

If we want to get fancy and save more disk space we could add a builder stage like we've done with samtools docker files, but I'd rather not since it's not too big of an image anyways. I'd prefer to get this version available sooner rather than later

I'm testing out with a small batch of samples on Terra, I'll give my approval and merge & deploy once those finish running successfully

Nice work with this one and apologies for the delay in reviewing

@kapsakcj
Copy link
Collaborator

OK, my tests succeeded. Identical serotypes predicted and no hiccups while running seqsero2 👍

I'm good to merge.

In the future it would be nice to drop in some more tests with datasets of reads and assemblies with known serotypes. I only tested 4 different serotypes, so would be good to test more.

@kapsakcj kapsakcj merged commit 50b9174 into master Apr 12, 2024
2 checks passed
@kapsakcj kapsakcj deleted the erin-seqsero2 branch April 12, 2024 12:44
@kapsakcj
Copy link
Collaborator

thank you for the PR! here's the deploy workflow: https://github.com/StaPH-B/docker-builds/actions/runs/8662243597

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants