-
Notifications
You must be signed in to change notification settings - Fork 7
Home
JAligner is an open-source Java implementation of the Smith-Waterman algorithm with Gotoh’s improvement for biological local pairwise sequence alignment using the affine gap penalty model.
- JAligner can be used through a friendly graphical user interface (GUI), simple command line syntax, and reusable programming application interface (API).
- The space complexity to perform the dynamic programming with the main similarity scores matrix and the 2 auxiliary gaps matrices is reduced from O( m × n ) to O( n ), where m and n are the sizes of the vertical sequence and horizontal sequence respectively, by using sufficient single-dimensional arrays of size n instead of the original two-dimensional arrays of size m × n.
- The two-dimensional array of size m × n, for holding the traceback directions (diagonal, left, up and stop), is mapped into a single-dimensional array of size m×n. This approach speeds up the process of memory allocation because the Java Virtual Machine (JVM) attempts to allocate a single-dimensional array of m × n “bytes” (primitive data type), instead of attempting to allocate an array of m “objects”, each of which is an “array” of n bytes.
- In addition to 70 standard scoring (or substitution) matrices, JAligner accepts user-defined scoring matrices.
java -jar jaligner.jar <s1> <s2> <matrix> <open> <extend>
where:
-
s1
: path to a file containing input sequence #1. -
s2
: path to a file containing input sequence #2. -
matrix
: name of a scoring matrix, or path to a file containing a user-defined scoring matrix. -
open
: open gap penalty. -
extend
: extend gap penalty.
java -jar jaligner.jar s1.fa s2.fa BLOSUM62 10.0 0.5
In order to load a user-defined scoring matrix from the file system, the path to the matrix file has to include at least one file separator (a file separator flags JAligner to load the scoring matrix from the file system instead of looking it up in jaligner.jar
.
java -jar jaligner.jar s1.fa s2.fa ./matrix.txt 10.0 0.5
A user-defined scoring matrix file is expected in the following format:
- optional comment lines (a comment line starts with a number sign
#
), - header line with the letters in the alphabet of the two sequences, and
- a line for each letter in the alphabet where each line starts with that letter followed by the substitution scores for the corresponding letters in the header line.
The command line to start JAligner as a desktop GUI application is java -jar jaligner.jar
. There are also downloadable installers for Linux, UNIX, Mac OS X and Windows.
Class SmithWatermanGotoh
has the public
static
method align
, that can be called programmatically to align two sequences.
The JVM uses by default a memory allocation pool of an initial size 2MB and a maximum size 64MB. Large sequences will raise the out of memory error, when the memory requirement exceeds the available space, so for such cases, it will be necessary to initialize the JVM with the proper heap size using the -Xms (the initial size) and -Xmx (the maximum size) options.
java -Xms128m -Xmx512m -jar jaligner.jar
The source code of JAligner is licensed under The GNU General Public License (GPL).
If you are using JAligner in a published work or product, please cite: Ahmed Moustafa, JAligner: open source Java implementation of Smith-Waterman (the date accessed)
For a list of publications citing JAligner, Google Scholar citations
- Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol. 1981 Mar 25;147(1):195-7. PMID: 7265238.
- Gotoh O. An improved algorithm for matching biological sequences. J Mol Biol. 1982 Dec 15;162(3):705-8. PMID: 7166760.
I deeply appreciate all people who have contributed with questions, comments or suggestions regarding JAligner, every single feedback has been helpful and I have learned from it. I would like to express my special thanks to:
- ej-technologies: providing free license for install4j (May 2005).
- Bram Minnaert: (1) detecting a bug in the initialization of the auxiliary matrices (October 2004), (2) suggesting a fix the traceback logic, and (3) providing testing modules for testing the produced alignments against the alignment scores (March 2005)
- Hector Gonzalez: detecting a bug in the initialization of the traceback matrix (March 2004),
- Andreas Doms: detecting a bug in the traceback stopping condition and suggesting a fix that improved the performance as well (February 2004),
- Ryan Golhar: recommending changing the traceback from recursion to iteration to avoid a stack overflow problem (August 2003), and
- Tim Carver: feedbacks on the GUI layout and alignment format (July 2003).