-- This project is not actively maintained anymore. --
GenMap computes the uniqueness of k-mers for each position in the genome while allowing for up to e mismatches. More formally, the uniqueness or (k,e)-mappability can be described for every position as the reciprocal value of how often each k-mer occurs approximately in the genome, i.e., with up to e mismatches. Hence, a mappability value of 1 at position i indicates that the k-mer in the sequence at position i occurs only once in the sequence with up to e errors. A low mappability value indicates that this k-mer belongs to a repetitive region. GenMap can be applied to single or multiple genomes and helps finding regions that are unique or shared by many or all genomes.
Below you can see the (4,1)-mappability and frequency M
and F
of the nucleotide sequence T = ATCTAGGCTAATCTA
.
The mappability value M[1] = 0.33
means that the 4-mer starting at position 1 T[1..3] = TCTA
occurs three times in the sequence with up to one mismatch: at positions 1 (TCTA)
, 6 (GCTA)
and 11 (TCTA)
.
The mappability can be exported in various formats that allow post-processing or display in genome browsers.
A small example on how to run GenMap is listed below, further details are on the GitHub Wiki pages.
For questions or feature requests feel free to open an issue on GitHub or send an e-mail to christopher.pockrandt [ÄT] fu-berlin.de
.
Christopher Pockrandt, Mai Alzamel, Costas S. Iliopoulos, Knut Reinert. GenMap: Ultra-fast Computation of Genome Mappability. Bioinformatics, 2020.
$ conda install -c bioconda genmap
Your CPU must support the POPCNT
instruction.
If you have a modern CPU, you can go with the optimized 64 bit version that additionally uses SSE4.
This improves the running time by 10 %.
To verify whether your CPU supports these instructions sets you can check the output of
cat /proc/cpuinfo | grep -E "mmx|sse|popcnt"
(Linux) or
sysctl -a | grep -i -E "mmx|sse|popcnt"
(Mac).
Platform | Download | Version | Additional requirements |
Linux 64 bit | 1.3.0 (2020-06-17) | - | |
Linux 64 bit optimized | 1.3.0 (2020-06-17) | requires SSE4 | |
Mac 64 bit | 1.3.0 (2020-06-17) | - | |
Mac 64 bit optimized | 1.3.0 (2020-06-17) | requires SSE4 |
If you want to build it from source, we recommend cloning the git repository as shown below. The tarballs on GitHub do not contain git submodules (i.e., SeqAn). Please note that building from source can easily take 10 minutes and longer depending on your machine and compiler.
$ git clone --recursive https://github.com/cpockrandt/genmap.git $ mkdir genmap-build && cd genmap-build $ cmake ../genmap -DCMAKE_BUILD_TYPE=Release $ make genmap
You can install genmap as follows
$ sudo make install $ genmap
or run the binary directly:
$ ./genmap
If you are using a very old version of Git (< 1.6.5) the flag --recursive
does not exist.
In this case you need to clone the submodule separately before you can run cmake
:
$ git clone https://github.com/cpockrandt/genmap.git $ cd genmap $ git submodule update --init --recursive
Requirements
- Operating System
- GNU/Linux, Mac
- Architecture
- Intel/AMD platforms that support
POPCNT
- Compiler
- GCC ≥ 4.9, LLVM/Clang ≥ 3.8
- Build system
- CMake ≥ 3.0
- Language support
- C++14
At first you have to build an index of the fasta file(s) whose mappability you want to compute. This step only has to be performed once. You might want to check out pre-built indices available for download.
$ ./genmap index -F /path/to/fasta.fasta -I /path/to/index/folder
A new folder /path/to/index/folder
will be created to store the index and all associated files.
There are two algorithms that can be chosen for index construction.
One uses RAM (divsufsort), one uses secondary memory/disk space (skew).
Depending on the quota and main memory limitations you can choose the appropriate algorithm with -A divsufsort
or
-A skew
.
It is recommended to use divsufsort (default setting).
It needs about 6n
space in main memory (or 10n
for fasta files >2GB).
n
is the number of bases in your fasta file(s).
It might be more or less depending on the number and length of the individual sequences.
If you are running out of memory, you can try to reduce the memory consumption a bit by inreasing -S, e.g., use -S 20 (up to 64)
Although this will slow down the algorithm to compute the mappability.
Skew needs more space on disk, at least 25n
.
You can change the location of the temp directory via the environment variable (e.g., to choose a directory with more quota):
$ export TMPDIR=/somewhere/else/with/more/space
To compute the (30,2)-mappability of the previously indexed genome, simply run:
$ ./genmap map -K 30 -E 2 -I /path/to/index/folder -O /path/to/output/folder -t -w -bg
This will create a text
, wig
and bedGraph
file in /path/to/output/folder
storing the computed mappability in
different formats.
You can omit formats that are not required by removing the corresponding flags -t
-w
or -bg
.
Instead of the mappability, the frequency can be outputted, you only have to add the flag -fl
to the previous command.
A detailed list of arguments and explanations can be retrieved with --help
:
$ ./genmap --help $ ./genmap index --help $ ./genmap map --help
More detailed examples can be found in the Wiki.
Building an index on a large genome takes some time and requires a lot of space. Hence, we provide indexed genomes for download.
If you need other genomes indexed and do not have the computational resources, please send an e-mail to christopher.pockrandt [ÄT] fu-berlin.de
.
The genomes where built with a higher sampling value (-S 20) to reduce the index size.
To increase speed when computing the mappability and outputting csv files, you can build your own index with a lower sampling value.
The genomes do not contain alt scaffolds (i.e., only chromosomes and unplaced/unlocalized fragments).
Genome | Index size (compressed) | Download |
Human GRCh38 [1] | 5.4 GB | GRCh38 index |
Human hs37-1kg [2] | 5.4 GB | hs37-1kg index |
Mouse GRCm38 | 4.9 GB | GRCm38 index |
D. melanogaster dm6 | 0.2 GB | dm6 index |
C. elegans ce11 | 0.1 GB | ce11 index |
Wheat T. aestivum ta45 [3] | 21.9 GB | ta45 index |
[1] | ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz |
[2] | ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz |
[3] | ftp://ftp.ensemblgenomes.org/pub/plants/release-45/fasta/triticum_aestivum/dna/Triticum_aestivum.IWGSC.dna.toplevel.fa.gz |