NOTE : unless you have specific reason to use pyfasta, I suggest to use pyfaidx
Author: | Brent Pedersen (brentp) |
---|---|
Email: | [email protected] |
License: | MIT |
Contents
Requires Python >= 2.6. Stores a flattened version of the fasta file without spaces or headers and uses either a mmap of numpy binary format or fseek/fread so the sequence data is never read into memory. Saves a pickle (.gdx) of the start, stop (for fseek/mmap) locations of each header in the fasta file for internal use.
>>> from pyfasta import Fasta >>> f = Fasta('tests/data/three_chrs.fasta') >>> sorted(f.keys()) ['chr1', 'chr2', 'chr3'] >>> f['chr1'] NpyFastaRecord(0..80)
# get full the sequence: >>> a = str(f['chr1']) >>> b = f['chr1'][:] >>> a == b True >>> f['chr1'][:10] 'ACTGACTGAC' # get the 1st basepair in every codon (it's python yo) >>> f['chr1'][::3] 'AGTCAGTCAGTCAGTCAGTCAGTCAGT' # can query by a 'feature' dictionary (note this is one based coordinates) >>> f.sequence({'chr': 'chr1', 'start': 2, 'stop': 9}) 'CTGACTGA' # same as: >>> f['chr1'][1:9] 'CTGACTGA' # use python, zero based coords >>> f.sequence({'chr': 'chr1', 'start': 2, 'stop': 9}, one_based=False) 'TGACTGA' # with reverse complement (automatic for - strand) >>> f.sequence({'chr': 'chr1', 'start': 2, 'stop': 9, 'strand': '-'}) 'TCAGTCAG'
Sometimes your fasta will have a long header like: "AT1G51370.2 | Symbols: | F-box family protein | chr1:19045615-19046748 FORWARD" when you only want to key off: "AT1G51370.2". In this case, specify the key_fn argument to the constructor:
>>> fkey = Fasta('tests/data/key.fasta', key_fn=lambda key: key.split()[0]) >>> sorted(fkey.keys()) ['a', 'b', 'c']
The default is to use a memmaped numpy array as the backend. In which case it's possible to get back an array directly...
>>> f['chr1'].as_string = False >>> f['chr1'][:10] # doctest: +NORMALIZE_WHITESPACE memmap(['A', 'C', 'T', 'G', 'A', 'C', 'T', 'G', 'A', 'C'], dtype='|S1') >>> import numpy as np >>> a = np.array(f['chr2']) >>> a.shape[0] == len(f['chr2']) True >>> a[10:14] # doctest: +NORMALIZE_WHITESPACE array(['A', 'A', 'A', 'A'], dtype='|S1')
mask a sub-sequence
>>> a[11:13] = np.array('N', dtype='S1') >>> a[10:14].tostring() 'ANNA'
It's also possible to specify another record class as the underlying work-horse for slicing and reading. Currently, there's just the default:
- NpyFastaRecord which uses numpy memmap
- FastaRecord, which uses using fseek/fread
- MemoryRecord which reads everything into memory and must reparse the original fasta every time.
- TCRecord which is identical to NpyFastaRecord except that it saves the index in a TokyoCabinet hash database, for cases when there are enough records that loading the entire index from a pickle into memory is unwise. (NOTE: that the sequence is not loaded into memory in either case).
It's possible to specify the class used with the record_class kwarg to the Fasta constructor:
>>> from pyfasta import FastaRecord # default is NpyFastaRecord >>> f = Fasta('tests/data/three_chrs.fasta', record_class=FastaRecord) >>> f['chr1'] FastaRecord('tests/data/three_chrs.fasta.flat', 0..80)
other than the repr, it should behave exactly like the Npy record class backend
it's possible to create your own using a sub-class of FastaRecord. see the source in pyfasta/records.py for details.
In order to efficiently access the sequence content, pyfasta saves a separate, flattened file with all newlines and headers removed from the sequence. In the case of large fasta files, one may not wish to save 2 copies of a 5GG+ file. In that case, it's possible to flatten the file "inplace", keeping all the headers, and retaining the validity of the fasta file -- with the only change being that the new-lines are removed from each sequence. This can be specified via flatten_inplace = True
>>> import os >>> os.unlink('tests/data/three_chrs.fasta.gdx') # cleanup non-inplace idx >>> f = Fasta('tests/data/three_chrs.fasta', flatten_inplace=True) >>> f['chr1'] # note the difference in the output from above. NpyFastaRecord(6..86) # sequence from is same as when requested from non-flat file above. >>> f['chr1'][1:9] 'CTGACTGA' # the flattened file is kept as a place holder without the sequence data. >>> open('tests/data/three_chrs.fasta.flat').read() '@flattened@'
there's also a command line interface to manipulate / view fasta files. the pyfasta executable is installed via setuptools, running it will show help text.
split a fasta file into 6 new files of relatively even size:
$ pyfasta split -n 6 original.fasta
split the fasta file into one new file per header with "%(seqid)s" being filled into each filename.:
$ pyfasta split --header "%(seqid)s.fasta" original.fasta
create 1 new fasta file with the sequence split into 10K-mers:
$ pyfasta split -n 1 -k 10000 original.fasta
2 new fasta files with the sequence split into 10K-mers with 2K overlap:
$ pyfasta split -n 2 -k 10000 -o 2000 original.fasta
show some info about the file (and show gc content):
$ pyfasta info --gc test/data/three_chrs.fasta
extract sequence from the file. use the header flag to make a new fasta file. the args are a list of sequences to extract.
$ pyfasta extract --header --fasta test/data/three_chrs.fasta seqa seqb seqc
extract sequence from a file using a file containing the headers not wanted in the new file:
$ pyfasta extract --header --fasta input.fasta --exclude --file seqids_to_exclude.txt
extract sequence from a fasta file with complex keys where we only want to lookup based on the part before the space.
$ pyfasta extract --header --fasta input.with.keys.fasta --space --file seqids.txt
flatten a file inplace, for faster later use by pyfasta, and without creating another copy. (Flattening)
$ pyfasta flatten input.fasta
(though for real use these will remain for faster access)
>>> os.unlink('tests/data/three_chrs.fasta.gdx') >>> os.unlink('tests/data/three_chrs.fasta.flat')
there is currently > 99% test coverage for the 2 modules and all included record classes. to run the tests:
$ python setup.py nosetests