Skip to content
Martin Asser Hansen edited this page Oct 2, 2015 · 7 revisions

Biopieces Introduction

Introduction

In the most simple form Biopieces can be pieced together on the command line like this (using the pipe character '|'):

read_data | calculate_something | write_result

However, a more comprehensive analysis could be composed:

read_data | select_entries | convert_entries | search_database | evaluate_results | plot_diagram | plot_another_diagram | load_to_database

In the documentation a long pipe like the above will typically be written on multiple lines to give a better over view like this:

read_data |
select_entries |
convert_entries |
search_database |
evaluate_results |
plot_diagram |
plot_another_diagram |
load_to_database

You can either type these on the command line in a single line, or copy/paste multiple lines.

Each Biopiece perform only a single task and may react only to data records (or Biopiece records) containing certain information. E.g. Biopieces that analyze sequence data only react to Biopiece records containing sequences, while all other records simply are passed through, so that they may be analyzed further downstream. Sometimes Biopiece records are not passed through, e.g. if sequences are assembled into longer sequences - then the input sequences are replaced by the output sequences. Biopieces do this by following the principle of least surprise. The data stream is terminated at the end of the last analysis.

The data stream that is piped through the Biopieces consists of records of key/value pairs in the same way a hash does in order to keep as simple a structure as possible. An example record can be seen below:

REC_TYPE: PATSCAN
MATCH: AGATCAAGTG
S_BEG: 7
S_END: 16
ALIGN_LEN: 9
S_ID: piR-t6
STRAND: +
PATTERN: AGATCAAGTG
---

The --- denotes the delimiter of the records, and each key is a word followed by a ':' and a white-space and then the value. By convention the Biopieces only uses upper case keys. Since the records basically are hash structures this mean that the order of the keys in the stream is unordered, and in the above example it is pure coincidence that HIT_BEG is displayed before HIT_END.

All of the Biopieces are able to read and write a data stream to and from file as long as the records are in the Biopieces format. This means that if you are undertaking a lengthy analysis where one of the steps is time consuming, you may save the stream after this step, and subsequently start one or more analysis from that last step ).

If you are running a lengthy analysis, it is highly recommended that you create a small test sample of the data and run that through the pipeline --- and once you are satisfied with the result proceed with the full data set.

All of the Biopieces can be supplied with long arguments prefixed with switches or single character arguments prefixed with - switches that can be grouped together (e.g. -xok). In this introduction only the long switches are used to emphasize what these switches do.

Getting started

First you need to be acquainted with some basic Unix commands, here is a tutorial (Tutorial 1-3 should do):

http://www.ee.surrey.ac.uk/Teaching/Unix/

And when you are ready to move on - move on:

In order to find out how a specific Biopiece works, you just type the program name without any arguments and press return and the usage of the Biopiece will be displayed. E.g. read_fasta <return>:

Biopiece: read_fasta

Usage

   read_fasta [options] -i <FASTA file(s)>

Options

   [-?           | --help]               #  Print full usage description.
   [-i <file(s)> | --data_in=<file(s)>]  #  Comma separated list of files or glob expression to read.
   [-n <int>     | --num=<int>]          #  Limit number of records to read.
   [-I <file>    | --stream_in=<file>]   #  Read input stream from file  -  Default=STDIN
   [-O <file>    | --stream_out=<file>]  #  Write output stream to file  -  Default=STDOUT

Help

read_fasta is part of the Biopieces framework.

http://www.biopieces.org

Now, this is only a short summary of the usage, to print the full usage use the --help switch to get this:

Biopiece: read_fasta

Description

read_fasta read in sequence entries from FASTA files. Each sequence
entry consists of a sequence name prefixed by a '>' followed by the sequence
name on a line of its own, followed by one or my lines of sequence until the
next entry or the end of the file. The resulting biopiece record consists of
the following record type:

   SEQ_NAME: test
   SEQ_LEN: 10
   SEQ: ATCGATCGAC
   ---

For more about the FASTA format:

http://en.wikipedia.org/wiki/Fasta_format

Usage

   read_fasta [options] -i <FASTA file(s)>

Options

   [-?           | --help]               #  Print full usage description.
   [-i <file(s)> | --data_in=<file(s)>]  #  Comma separated list of files or glob expression to read.
   [-n <int>     | --num=<int>]          #  Limit number of records to read.
   [-I <file>    | --stream_in=<file>]   #  Read input stream from file  -  Default=STDIN
   [-O <file>    | --stream_out=<file>]  #  Write output stream to file  -  Default=STDOUT

Examples

To read all FASTA entries from a file:

   read_fasta -i test.fna

To read in only 10 records from a FASTA file:

   read_fasta -n 10 -i test.fna

To read all FASTA entries from multiple files:

   read_fasta -i test1.fna,test2.fna

To read FASTA entries from multiple files using a glob expression:

   read_fasta -i '*.fna'


See also

read_align

write_fasta

Author

Martin Asser Hansen - Copyright (C) - All rights reserved.

[email protected]

August 2007

License

GNU General Public License version 2

http://www.gnu.org/copyleft/gpl.html

Help

read_fasta is part of the Biopieces framework.

http://www.biopieces.org

Or you can go the the appropriate wiki page read_fasta at

http://www.biopieces.org/

Which currently points to Github:

http://github.com/maasha/biopieces/wiki/read_fasta

The Data Stream

Since Biopieces work by passing data records from Biopiece to Biopiece you always need more than one Biopiece (execpt special cases). Typically, one Biopiece to parse data from some format, and one Biopiece to perform some analysis like counting or plotting - or writing the data in a new format. Data in different formats can be read with the appropriate Biopiece for that format. The Biopieces are typically named read_<data type> such as read_fasta, read_tab, etc., and all behave in a similar manner. Data can be read by supplying the --data_in switch and a file name to the file containing the data:

<biopiece> --data_in=<file>

Consider the following three FASTA entries in the file test.fna:

>test1
AGATTGAGATTATACTTGCAGCACCCGTCC
>test2
ACGTATTCGCGGGACAACACTTTCTGAACT
>test3
ATTGTAACGAACACCAGATCCCTGCTTGTG

In order to read these we use read_fasta like this:

read_fasta --data_in=test.fna                                    

SEQ_NAME: test1
SEQ: AGATTGAGATTATACTTGCAGCACCCGTCC
SEQ_LEN: 30
---
SEQ_NAME: test2
SEQ: ACGTATTCGCGGGACAACACTTTCTGAACT
SEQ_LEN: 30
---
SEQ_NAME: test3
SEQ: ATTGTAACGAACACCAGATCCCTGCTTGTG
SEQ_LEN: 30
---

and et viola, the sequences are emitted in the data stream as Biopiece records - and displayed in the terminal window. Now we perform the most basic analysis - we count the records using count_records:

read_fasta --data_in=test.fna | count_records

SEQ: AGATTGAGATTATACTTGCAGCACCCGTCC
SEQ_LEN: 30
SEQ_NAME: test1
---
SEQ: ACGTATTCGCGGGACAACACTTTCTGAACT
SEQ_LEN: 30
SEQ_NAME: test2
---
SEQ: ATTGTAACGAACACCAGATCCCTGCTTGTG
SEQ_LEN: 30
SEQ_NAME: test3
---
RECORDS_COUNT: 3
---

Now the sequence records are supplemented with an extra record holding the RECORD_COUNT. But that is not very handy, if you have millions of sequences, that these are all displayed. This is because the data stream never stops unless the user want to save the stream or by applying the --no_stream switch that will terminate the stream. The --no_stream switch only works with those Biopieces where it makes sense that the user might want to terminate the data stream, i.e. after an analysis step where the user wants to output the result, but not the data stream. Thus if we use the --no_stream with count_records we get:

read_fasta --data_in=test.fna | count_records --no_stream

RECORDS_COUNT: 3
---

Which has the desired effect. Notice that the output from count_records is a Biopiece record, but it was not part of the data stream.

If we want to save the sequence data in a new FASTA file we have to use write_fasta:

read_fasta --data_in=test.fna | count_records --no_stream | write_fasta

RECORDS_COUNT: 3
---

But this is wrong? - there are no FASTA entries. The output from count_records is a Biopiece record with the RECORD_COUNT and that record is passed to write_fasta don't detect any sequence type records . Of cause we need to remove the --no_stream switch to count_records:

read_fasta --data_in=test.fna | count_records | write_fasta
RECORDS_COUNT: 3
---
>test1
AGATTGAGATTATACTTGCAGCACCCGTCC
SEQ: AGATTGAGATTATACTTGCAGCACCCGTCC
SEQ_NAME: test1
SEQ_LEN: 30
---
>test2
ACGTATTCGCGGGACAACACTTTCTGAACT
SEQ: ACGTATTCGCGGGACAACACTTTCTGAACT
SEQ_NAME: test2
SEQ_LEN: 30
---
>test3
ATTGTAACGAACACCAGATCCCTGCTTGTG
SEQ: ATTGTAACGAACACCAGATCCCTGCTTGTG
SEQ_NAME: test3
SEQ_LEN: 30
---

But this is also wrong? We now have a mixture of FASTA type data and Biopiece records including a RECORDS_COUNT record. This is of cause because we forgot to terminate the stream with --no_stream:

read_fasta --data_in=test.fna | count_records | write_fasta --no_stream

>test1
AGATTGAGATTATACTTGCAGCACCCGTCC
>test2
ACGTATTCGCGGGACAACACTTTCTGAACT
>test3
ATTGTAACGAACACCAGATCCCTGCTTGTG

Now the sequences are output nicely in FASTA format, but where did the RECORD_COUNT go? That was passed to write_fasta, which still does not react to records without SEQ_NAME and SEQ keys - thus this record was simply dropped. The trick is to save the output of count_records to a file:

read_fasta --data_in=test.fna | count_records --data_out=count.txt                          

SEQ: AGATTGAGATTATACTTGCAGCACCCGTCC
SEQ_LEN: 30
SEQ_NAME: test1
---
SEQ: ACGTATTCGCGGGACAACACTTTCTGAACT
SEQ_LEN: 30
SEQ_NAME: test2
---
SEQ: ATTGTAACGAACACCAGATCCCTGCTTGTG
SEQ_LEN: 30
SEQ_NAME: test3
---

Now the COUNT_RECORD entry is saved to the file count.txt and the sequence type Biopiece records are passed through. We can verify the content of count.txt using the UNIX command cat:

cat count.txt

RECORDS_COUNT: 3
---

Of cause we could still use --no_stream with count_records while at the same time saving the output to file:

read_fasta --data_in=test.fna | count_records --data_out=count.txt --no_stream

Thus the RECORD_COUNT is written to the count.txt file and no Biopiece records are emitted. Now let us save the record count and the FASTA entries:

read_fasta --data_in=test.fna | count_records --data_out=count.txt | write_fasta --data_out=test2.fna

SEQ: AGATTGAGATTATACTTGCAGCACCCGTCC
SEQ_NAME: test1
SEQ_LEN: 30
---
SEQ: ACGTATTCGCGGGACAACACTTTCTGAACT
SEQ_NAME: test2
SEQ_LEN: 30
---
SEQ: ATTGTAACGAACACCAGATCCCTGCTTGTG
SEQ_NAME: test3
SEQ_LEN: 30
---

Now the COUNT_RECORD is saved to count.txt like before, and the FASTA entries are saved to test2.fna which we can verify with cat:

cat test2.fna

>test1
AGATTGAGATTATACTTGCAGCACCCGTCC
>test2
ACGTATTCGCGGGACAACACTTTCTGAACT
>test3
ATTGTAACGAACACCAGATCCCTGCTTGTG

The final touch is to use --no_stream with write_fasta to terminate the stream:

read_fasta --data_in=test.fna | count_records --data_out=count.txt | write_fasta --data_out=test2.fna --no_stream

Which is identical to the below short-hand command using the short switches:

read_fasta -i test.fna | count_records -o count.txt | write_fasta -xo test2.fna

This is the end of the introduction. Now it is a good time to head over to the HowTo for inspiration.

Clone this wiki locally