Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPVC-2344: Update SeqRepo load to discover local fasta files #26

Merged
merged 3 commits into from
Apr 25, 2024

Conversation

bsgiles73
Copy link

@bsgiles73 bsgiles73 commented Apr 19, 2024

This PR adds logic to the sbin/uta-extract script to discover fasta files rather then have them hardcoded. Those files will be copied into a single working directory.

During testing of this work I found an issue with our Docker image. It did not have tabix installed and thus could not add sequences to SeqRepo. So the dockerfile was updated to include it.

To test I ran the following...

sgiles-MD6M:uta shane.giles$ docker compose run uta-extract
WARN[0000] The "UTA_ETL_SKIP_GENE_LOAD" variable is not set. Defaulting to a blank string.
/opt/repos/uta/sbin/ncbi-parse-gbff:33: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
  import pkg_resources
2024-04-24 17:57:08 INFO     [__main__] opened /ncbi-dir/refseq/H_sapiens/mRNA_Prot/human.test.rna.gbff.gz
/usr/local/lib/python3.10/dist-packages/Bio/GenBank/Scanner.py:1217: BiopythonParserWarning: Premature end of file in sequence data
  warnings.warn(
2024-04-24 17:59:19 INFO     [__main__] 642 genes in /ncbi-dir/refseq/H_sapiens/mRNA_Prot/human.test.rna.gbff.gz (Counter({'NM': 1384, 'NR': 380}))
2024-04-24 17:59:19 INFO     [__main__] 642 genes in 1 files (Counter({'NM': 1384, 'NR': 380}))
/opt/repos/uta/sbin/ncbi_parse_genomic_gff.py:33: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
  import pkg_resources
2024-04-24 17:59:20 INFO     [__main__] read 1776 transcript alignments from file(s): /ncbi-dir/genomes/refseq/vertebrate_mammalian/Homo_sapiens/all_assembly_versions/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_genomic.gff.gz
2024-04-24 17:59:21 WARNING  [__main__] Exon set transcript NM_000853.3 not found in txinfo file. Filtering out.
2024-04-24 17:59:21 WARNING  [__main__] Exon set transcript NR_003491.3 not found in txinfo file. Filtering out.
2024-04-24 17:59:21 WARNING  [__main__] Exon set transcript NR_033319.2 not found in txinfo file. Filtering out.
2024-04-24 17:59:21 WARNING  [__main__] Exon set transcript NR_033320.2 not found in txinfo file. Filtering out.
2024-04-24 17:59:21 WARNING  [__main__] Exon set transcript NR_033321.2 not found in txinfo file. Filtering out.
2024-04-24 17:59:21 WARNING  [__main__] Exon set transcript NR_156186.2 not found in txinfo file. Filtering out.
2024-04-24 17:59:21 WARNING  [__main__] Exon set transcript NM_001284286.1 not found in txinfo file. Filtering out.
2024-04-24 17:59:21 WARNING  [__main__] Exon set transcript NM_001284289.1 not found in txinfo file. Filtering out.
2024-04-24 17:59:21 WARNING  [__main__] Exon set transcript NM_001284288.1 not found in txinfo file. Filtering out.
2024-04-24 17:59:21 WARNING  [__main__] Exon set transcript NM_001002837.2 not found in txinfo file. Filtering out.
2024-04-24 17:59:21 WARNING  [__main__] Exon set transcript NM_001284287.1 not found in txinfo file. Filtering out.
2024-04-24 17:59:21 WARNING  [__main__] Exon set transcript NM_001350317.2 not found in txinfo file. Filtering out.
2024-04-24 17:59:21 INFO     [__main__] Filtered out exon sets for 12 transcript(s): NM_001350317.2,NR_003491.3,NM_001002837.2,NM_000853.3,NR_033320.2,NM_001284289.1,NR_033319.2,NM_001284287.1,NR_033321.2,NM_001284288.1,NR_156186.2,NM_001284286.1

Verified the fasta files were copied to the working directory

sgiles-MD6M:uta shane.giles$ cd output/artifacts/
sgiles-MD6M:artifacts shane.giles$ ll
total 29224
-rw-r--r--  1 shane.giles  staff  11122313 Apr 24 11:59 GCF_000001405.25_GRCh37.p13_genomic.fna.gz
-rw-r--r--  1 shane.giles  staff    307826 Apr 24 11:59 GCF_000001405.25_GRCh37.p13_protein.faa.gz
-rw-r--r--  1 shane.giles  staff   1330779 Apr 24 11:59 GCF_000001405.25_GRCh37.p13_rna.fna.gz
-rw-r--r--  1 shane.giles  staff     19252 Apr 24 11:57 assocacs.gz
-rw-r--r--  1 shane.giles  staff     60177 Apr 24 11:59 exonsets.gz
-rw-r--r--  1 shane.giles  staff       174 Apr 24 11:59 filtered_tx_acs.txt
-rw-r--r--  1 shane.giles  staff     33216 Apr 24 11:57 geneinfo.gz
-rw-r--r--  1 shane.giles  staff    301989 Apr 24 11:59 human.test.protein.faa.gz
-rw-r--r--  1 shane.giles  staff   1623010 Apr 24 11:59 human.test.rna.fna.gz
drwxr-xr-x  7 shane.giles  staff       224 Apr 17 16:35 splign-manual
-rw-r--r--  1 shane.giles  staff     79638 Apr 24 11:59 txinfo.gz
-rw-r--r--  1 shane.giles  staff     60481 Apr 24 11:59 unfiltered_exonsets.gz

Then I was able to successfully run seqrepo-load

sgiles-MD6M:uta shane.giles$ docker compose run seqrepo-load
WARN[0000] The "UTA_ETL_SKIP_GENE_LOAD" variable is not set. Defaulting to a blank string.
human.test.protein.faa.gz: 100%|██████████| 5/5 [00:20<00:00,  4.17s/file]

…ment file to set SeqRepo version, remove cp from uta-extract
@bsgiles73 bsgiles73 changed the title IPVC-2344: Update SeqRepo load to discover fasta files IPVC-2344: Update SeqRepo load to discover local fasta files Apr 19, 2024
@bsgiles73 bsgiles73 requested a review from nvta1209 April 24, 2024 18:16
@bsgiles73 bsgiles73 merged commit 8ff7464 into main Apr 25, 2024
1 check passed
@bsgiles73 bsgiles73 deleted the IPVC-2344-pass-fasta-files-to-seqrepo-load branch April 25, 2024 22:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants