Here, the changes to genomepy
will be summarized.
The format is based on Keep a Changelog.
- fix for NCBI's assembly report header "asm_submitter" instead of "submitter"
0.16.0 - 2023-05-31
genomepy search
now accepts the--exact
flaggenomepy.Annotation.attributes()
returns a list of all attributes from the GTF attributes column.- e.g. gene_name, gene_version
- nice to use with
genomepy.Annotation.from_attributes()
orgenomepy.Annotation.gtf_dict()
- When installing assemblies from older Ensembl release versions, a clearer error message is given if assembly cannot be found:
- if the release does not exist, options will be given
- if the assembly does not exist on the release version, all available options are given
- if the URL to the genome or annotation files is incorrect, the error message stays the same
- new config option:
ucsc_mirror
, options:eu
orus
.- the mirror should only affect download speed
- can be nice if the other mirror is down!
- function
get_division
is now a class method of EnsemblProvider - EnsemblProvider class methods
get_division
andget_version
now require an assembly name. - UCSC data is now downloaded over HTTPS instead of HTTP
genomepy.install()
now returns aGenome
instance with updated annotation attributes.- now ignoring ~1600 assemblies from the Ensembl database with incorrect metadata
- no easy way to retrieve this data
0.15.0 - 2023-02-28
- you can now tune the cache expiration time in the config
- create a config with
genomepy config generate
, then tweak the values as desired.
- create a config with
- support for biopython >=1.80 with pyfaidx update
- raise an informative error when UCSC tools are missing
- this should only happen in Pip installations
- disabling already disabled plugins no longer throws an error
- bgzipping fixes:
- bgzip works again with python>3.7 (openssl shenanigans. tabix was deprecated for htslib)
- genome index works with
genome install --bgzip
(a 2nd is created with the correct naming format) - export file works with
genome install --bgzip
genomepy.install_genome(bgzip=True)
returns a Genome class instance with correct paths
0.14.0 - 2022-08-01
- now using
filelock
for improved thread safety - now checking if every API/FTP/HTTP(S) is accessible before proceeding
- genomepy search improvements:
- text search now accepts regex, and multiple substrings (space separated) are unordered.
- taxonomy search now returns all hits that start with the given number.
- switched to
pyproject.toml
+hatchling
for packaging
- updated the README and CLI documentation to mention the
Local
provider
0.13.1 - 2022-06-21
- removed unused keys from Ensembl and UCSC databases to reduce their size
- added a retry for initializing the diskcache (seq2science/issues/887)
- can now find ensembl urls for genomes not using url_names properly (#205)
0.13.0 - 2022-06-02
genomepy search
andgenomepy genomes
can now return the (unfiltered) absolute genome size with argument--size
- changed caching backend to
diskcache
(thread safe) - reduced the local cache size of NCBI (by about half)
- by only storing assembly summary columns actually used by genomepy
0.12.0 - 2022-03-28
genomepy.Annotation.lengths()
to retrieve the gene/transcript lengths.genomepy.Annotation.from_attributes()
can extract any sub-column that pesky attributes column
- updated Boyle-lab blacklists
genomepy.Annotation.genes()
default changed from bed (commonly containing transcript names) to gtf (gene names)
- blacklists now work with GENCODE
query_mygene
no longer filters input.genomepy install
with local provider now understands you want the annotation if you pass a path to an annotation
0.11.1 - 2022-01-06
quiet
flag forgenomepy.Annotation
genomepy -v
flag
genomepy.Annotation
returns aFileNotFoundError
instead of aValueError
where appropriate.download_assembly_report
refactored. Now downloads the report for the exact same assembly accession (and not the nearest NCBI assembly).- broader unit tests for UCSC assembly accession scraping
- inconsistent behaviour with assembly reports (#193 + #194)
0.11.0 - 2021-11-18
- extened docstrings
- GENCODE support (GENCODE gene annotations with UCSC genomes)
- only contains the main chromosomes, no scaffolds or alternate haplotypes.
- only contains 4 assemblies (2 mouse, 2 human)
- excellent annotations for these regions & species though!
- Ensembl's GRCh37 can now be downloaded through genomepy
- Local fasta/gtf/gff(3)/bed file support
- you can install a local genome and/or annotation by providing local path(s) to
genomepy install
- if annotation downloading is requested, but not annotation path is provided, a gtf/gff(3) annotation will be sought in the genome's source directory.
- you can install a local genome and/or annotation by providing local path(s) to
Annotation.gtf_dict
creates a dictionary for any key-value pair in the GTF columns or attribute fields!- e.g.
Annotation.gtf_dict("seqname", "gene_name")
- e.g.
- Genome.track2fasta can now ignore comment lines (starting with
#
) - Genome.track2fasta will skip header lines (a warning will be printed)
- Genome.track2fasta will ignore regions that cannot be parsed (a warning will be printed)
- these fixes should improve
gimme scan
performance and feedback
- these fixes should improve
- UCSC annotation conversion tool settings tweaked. Better results with source gff files.
- Ensembl now uses HTTP instead of FTP (in some cases). This improves stability on some servers.
- tweaked search result alignment for clarity
- explained UCSC annotations in the README
- better file path handling (relative paths, user home and variables are expanded)
Annotation
now accepts a file/directory/genomepy name as first argument.- this merges 2 arguments into one.
Annotation.map_genes
now works without a README file- you can now set Annotation.tax_id manually.
- Ensembl annotations from previous releases can now be downloaded as intended.
- Genome.track2fasta will skip regions that clearly dont make sense (start>end, and start<0)
0.10.0 - 2021-07-30
- Annotation class, containing
- regex filter (
genomepy.Annotation.filter_regex()
) - sanitize functions (
genomepy.Annotation.sanitize()
)- option to skip filtering and/or matching the annotation to the genome (also on CLI)
- gene name remapping to various formats (
genomepy.Annotation.map_genes()
)- using MyGene.info. Can be queried separately (
genomepy.annotation.query_mygene()
)
- using MyGene.info. Can be queried separately (
- contig name remapping to other provider formats (
genomepy.Annotation.map_locations()
) - get the annotations, or gene locations, as dataframes (
genomepy.Annotation.gtf
,bed
orgene_coords()
respectively) - get the gene names as a list (
genomepy.Annotation.genes("gtf")
orgenomepy.Annotation.genes("bed")
)
- regex filter (
genomepy install
now attempts to install the NCBI assembly report- NCBI provider also indexes the NCBI
genbank_historical
summary genomepy search
now shows if the genome has an annotation- this slows down the results a bit
- to compensate, results are now shown as soon as they are found
- for UCSC, availability of any of the 4 annotations is shown
genomepy annotation
shows the first line(s) of each gene annotation.gtf- for developers:
- pre-commit-hooks for linting
- formatting/linting script
tests/format.sh
(optional argumentlint
) - isort & autoflake formatters
- provider module split per provider
- ProviderBase overhauled, now called Provider
- regex filtering separated from
Provider.download_genome
- utils module split into utils, files and online
- now using loguru for pretty logging
- accession
search
improved- now finds GCA and GCF accessions
- now ignores patch levels
genomepy install
automatic provider selection refactoredProvider.online_providers
returns a generator (faster!)
genomepy install
uses a combined filter function (faster!)genomepy install
only zips annotation files if the genome is zipped (with the bgzip flag) (faster!)- NCBI provider should be parsed faster (faster!)
- new dependency: pandas
- tests no longer format code
- broken URLs should keep genomepy occupied for less long (check_url will immediately return on "Not Found" errors 404/450) (faster!)
- the
Genome
class now passes arguments to the parentFasta
class - the
Genome
class now regenerates the sizes and gaps files similarly to theFasta
class and its index (when the genome is younger) (faster!) - somewhat more pythonic tests
0.9.3 - 2021-02-03
- URL provider got better at searching for annotation files
- NCBI provider will fall back on FTP if HTTPS is offline
- genomes from ftp locations not working
0.9.2 - 2021-01-28
- progress bars for downloading and bgzipping (the slow stuff)
- spinner to indexing plugins (the slowest stuff)
- removed dependency of psutils
- added dependency of tqdm
- an oopsie in the regex filter functions slowing down
install
. - rm_rf and mkdir_p to behave more like their namesakes.
0.9.1 - 2020-10-26
genomepy install
flag-k/--keep-alt
to keep alternative regions- argparse custom type for a genome command line argument
- added retries to UCSC and NCBI
- added retries to Travis tests
- Bucketcache improvements
genomepy search
keeps searching after an exact match is foundgenomepy install
removes alternative regions by default
genomepy clean
wont complain when there is nothing to clean- properly gzip the annotation.gtf if it was unzipped during sanitizing
genomepy install
can use the URL provider againgenomepy install
with-f/--force
will overwrite previouse sizes and gaps files
0.9.0 - 2020-09-01
- check to see if providers are online + error message if not
- automatic provider selection for
genomepy install
- optional provider flag for
genomepy install
(-p/--provider
) - if no provider is passed to
genomepy install
, the first provider with the genome is used (order: Ensembl > UCSC > NCBI).
- optional provider flag for
genomepy clean
removes local caches. Will be reloaded when required.
- Ensembl genomes always download over ftp (http was too unstable)
- Ensembl release versions obtained via REST API (http was too unstable)
genomepy search
andgenomepy providers
only check online providers- Online function now have a timeout and a retry system
- API changes to
download_genome
anddownload_annotation
for consistency
- Ensembl status check uses lighter url (more stable)
search
andinstall
now consistently use safe search terms (no spaces)search
now uses UTF-8, no longer crashing for \u2019 (some quotation mark).search
case insensitivity fixed for assembly names.- Bucketcache now stores less data, increasing responsiveness.
0.8.4 - 2020-07-29
- Fix bug where Genome.sizes dict contains str instead of int (#110).
- Fix bug with UTF-8 in README (#109).
- Fix bug where BED files with chr:start-end in 4th column are not recognized as BED files.
0.8.3 - 2020-06-03
- Fixed bug introduced by fixing a bug: Provider-specific options for
genomepy install
on command line work again - UCSC annotations can now once again be obtained from knownGene.txt
- UCSC gene annotations will now be downloaded in GTF format where possible
- Desired UCSC gene annotation type can now be specified in the
genomepy install
command using--ucsc-annotation
- Added the NCBI RefSeq gene annotation to the list of potential UCSC gene annotations for download
0.8.2 - 2020-05-25
Genome.sizes
andGenome.gaps
are now populated automatically.- backwards compatibility with old configuration files (with
genome_dir
instead ofgenomes_dir
) - updating the README.txt will only happen if you have write permission
- after gzipping files the original unzipped file is now properly removed
- providers will only download genome summaries when specifically queried
- updated blacklist for hg38/GRCh38 based on work by Anshul Kundaje, see ENCODE README.txt
0.8.1 - 2020-05-11
- Now using the UCSC REST API
genomepy search
now accepts taxonomy IDsgenomepy search
will now return taxonomy IDs and Accession numbers- The README.txt will now store taxonomy IDs and Accession numbers
- Gene annotations:
- Downloading of annotation file (BED/GTF/GFF3) from URL
- Automatic search for annotation file (GTF/GFF3) in genome directory when downloading from URL
- Option for URL provider to link to annotation file (to process it similarly to other providers)
- Automatic annotation sanitizing (and skip sanitizing flag
-s
forgenomepy install
) - Option to only download annotation with
genomepy install -o
- Plugins:
- Blacklists are automatically unzipped.
- Multithreading support for plugins, thanks to @alienzj!
- STAR and HISAT2 will now generate splice-aware indexes if annotation files are available.
Genome.props
has been renamed toGenome.plugin
- sizes no longer a plugin, but always gets executed
genomepy FUNCTION --help
texts expanded- all genomepy classes exported when imported into Python
- all providers now let you know when they are downloading assembly information.
- more descriptive feedback to installing & many errors
- Sizes plugin
- Old tests
- Removed outdated dependency
xmltodict
genomepy config
options made more robust- README.txt will no longer:
- update 3x for each command
- drop regex info
- have duplicate lines
- Genome class moved to
genome.py
- Many functions moved to
utils.py
- Many other functions made static methods of a class
Genome.track2fasta
andGenome.get_random_sequence
optimized- All Provider classes now store their genomes as a dict-in-dict, with the assembly name as key.
- Many Provider class functions now standardized. Many functions moved to from the daughter classes to the ProviderBase class.
- README.txt file generation and updating standardized
- Unit tests! all functions now have an individual test. Almost all test use functions already tested prior to them.
- Old tests incorporated in several extra tests (e01, e02, e03).
- Raise statements now use more fitting errors
- All instances of
os.remove
exchanged foros.unlink
- Almost all warnings fixed
- Extensive, COVID19-enabled, and somewhat pointless alphabetizing, optimizing and/or organizing changes to
- imports everywhere
.gitignore
.travis.yml
release_checklist.md
cli.py
- strings (many strings with .format() replaced with f-strings)
0.7.2 - 2019-03-31
- Fix minor issue with hg19 wrong blacklist url
- Ensembl downloads over http instead of https (release 99 no longer has https)
0.7.1 - 2019-11-20
- STAR is not longer enabled by default
0.7.0 - 2019-11-18
- Direct downloading from url through url provider.
- Added
--force
flag. Files will no longer be overwritten by default. - Provider specific options:
--ensembl-version
: specify release version.--ensembl-toplevel
: by default,genomepy install
will search for primary assemblies. This flag will only download toplevel assemblies.
- Added STAR index plugin
- Providers are now case-insensitive.
- Extended testing.
- Increased minimal Python version to 3.6.
- Removed gaps from plugins, added gaps to core functionality.
- bugfix: NCBI will show all versions of an assembly (will no longer filter on BioSample ID, instead filters on asm_name).
- fix: gaps file will be generated when needed.
0.6.1 - 2019-10-10
- Fixed bug with get_track_type.
0.6.0 - 2019-09-11
- Support for storing bzgip-compressed genomes (#41).
- Removed support for Python 2 (2020 is close!).
- Ensembl annotation for non-vertebrate genomes should work again.
- Fixed bug where a deleted or empty config file would result in an error.
0.5.5 - 2019-03-19
- Plugin for downloading genome blacklists (from Kundaje lab).
- Fix for new Ensembl REST API and FTP layout.
- Genomes from Ensembl with a space in their name can be downloaded.
- Plugin imports use relative parts to prevent conflicts with other imports.
0.5.4 - 2019-03-19
- Downloading annotation from NCBI now implemented.
- Genbank assemblies at NCBI can be searched and downloaded
- Fixed #23.
- Fixed #26.
- Fixed Ensembl downloads (#30)
- Fixed FTP tests for CI
0.5.2 - 2018-09-11
- Fixed genomes_dir argument to
genomepy install
- Fixed msgpack dependency
- Fixed issue with
config generate
where config directory does note exist.
- Added requests dependency
- Removed dependency on xdg, as it didn't support OSX
- Fixed string decoding bug
- Started CHANGELOG.
- Genome listings are cached locally.
- Added
-m hard
option toinstall
to hard-mask sequences. - Added
-l
option toinstall
for a custom name. - Added
-r
and--match/--no-match
option to select sequences by regex.