Skip to content
This repository has been archived by the owner on Oct 28, 2022. It is now read-only.

Commit

Permalink
Adding documentation
Browse files Browse the repository at this point in the history
Adding in Doxygen capabilities for generating UML.

Adding UML diagram.

Specifying how the UML diagram can be generated.

Automating Avrodoc build with a script.

Adding proper Beacon stuff to the UML

Updating UML to drop AlleleResource

Adding a Graph Mode FAQ

It would be good to have the answers to people's questions about graph mode
all in one place.

Moving and renaming documentation

All the extra Markdowns should go in doc/, and should not have spaces
in the filenames.

Adding an SVG of the UML to the repo.

Make the FAQ make sense with the side graph changes.
  • Loading branch information
adamnovak committed Apr 3, 2015
1 parent 1cbf5ba commit b568aa1
Show file tree
Hide file tree
Showing 7 changed files with 25,615 additions and 0 deletions.
1,781 changes: 1,781 additions & 0 deletions contrib/Doxyfile

Large diffs are not rendered by default.

124 changes: 124 additions & 0 deletions contrib/avdlDoxyFilter.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
#!/usr/bin/env python2.7
"""
avdlDoxyFilter.py: hack Avro IDL files into vaguely C++-like files that Doxygen
can read.
Re-uses sample code and documentation from
<http://users.soe.ucsc.edu/~karplus/bme205/f12/Scaffold.html>
"""

import argparse, sys, os, itertools, re

def parse_args(args):
"""
Takes in the command-line arguments list (args), and returns a nice argparse
result with fields for all the options.
Borrows heavily from the argparse documentation examples:
<http://docs.python.org/library/argparse.html>
"""

# The command line arguments start with the program name, which we don't
# want to treat as an argument for argparse. So we remove it.
args = args[1:]

# Construct the parser (which is stored in parser)
# Module docstring lives in __doc__
# See http://python-forum.com/pythonforum/viewtopic.php?f=3&t=36847
# And a formatter class so our examples in the docstring look good. Isn't it
# convenient how we already wrapped it to 80 characters?
# See http://docs.python.org/library/argparse.html#formatter-class
parser = argparse.ArgumentParser(description=__doc__,
formatter_class=argparse.RawDescriptionHelpFormatter)

# Now add all the options to it
parser.add_argument("avdl", type=argparse.FileType('r'),
help="the AVDL file to read")

return parser.parse_args(args)


def main(args):
"""
Parses command line arguments, and does the work of the program.
"args" specifies the program arguments, with args[0] being the executable
name. The return value should be used as the program's exit code.
"""

options = parse_args(args) # This holds the nicely-parsed options object

# Are we in a comment?
in_comment = False

# What level of braces are we in?
brace_level = 0;

for line in options.avdl:
# For every line of Avro

# See if it's a comment start or end.
comment_starter = line.rfind("/*")
comment_ender = line.rfind("*/")

if(comment_starter != -1 and (comment_ender == -1 or
comment_ender < comment_starter)):
# We have entered a multiline comment

in_comment = True
elif comment_ender != -1:
# We have ended a multiline comment and not started another one.
in_comment = False

if in_comment:
# Just pass comments as-is
print(line.rstrip())
continue

# How many unbalanced braces do we have outside comments?
brace_change = line.count("{") - line.count("}")

if line.lstrip().startswith("protocol"):
# It's a protocol, so make it a Module and an Interface.

# Grab the protocol name
name = re.search('protocol\s+(\S+)', line).group(1)

# Make the open lines
print("namespace {} {{".format(name))
#print("interface {} {{".format(name))

elif line.lstrip().startswith("record"):
# It's a record, so make it a Struct.

# Grab the record name
name = re.search('record\s+(\S+)', line).group(1)

print("struct {} {{".format(name))

elif line.lstrip().startswith("union"):
# We need to fix up the union with semicolons.

# Parse out the union
match = re.search("union\s*{(.*)}(.*)", line)

# What got unioned?
unioned = match.group(1)

# What's the rest of the line?
rest = match.group(2)

# Make the union a template as far as Doxygen knows.
print("union<{}>{}".format(unioned, rest))


elif line.rstrip().endswith("}"):
# The line is closing something, so it needs a semicolon.
print("{};".format(line.rstrip()))
else:
# Pass other lines
print(line.rstrip())

# Change the brace level.
brace_level += brace_change

if __name__ == "__main__" :
sys.exit(main(sys.argv))
47 changes: 47 additions & 0 deletions contrib/make_avrodoc.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
#!/usr/bin/env bash

# Script to make the avrodoc documentation. Run from the contrib folder:
# $ contrib/make_avrodoc.sh
# Depends on avrodoc already being on the PATH.
# Can install the Avro command line tools jar itself.

if [ -d contrib ]
then
# Make sure we are in the contrib directory.
cd contrib
fi

if [ ! -f avro-tools.jar ]
then

# Download the Avro tools
curl -o avro-tools.jar http://www.us.apache.org/dist/avro/avro-1.7.7/java/avro-tools-1.7.7.jar
fi

# Make a directory for all the .avpr files
mkdir -p ../target/schemas

# Make a place to put the documentation
mkdir -p ../target/documentation

for AVDL_FILE in ../src/main/resources/avro/*.avdl
do
# Make each AVDL file into a JSON AVPR file.

# Get the name of the AVDL file without its extension or path
SCHEMA_NAME=$(basename "$AVDL_FILE" .avdl)

# Decide what AVPR file it will become.
AVPR_FILE="../target/schemas/${SCHEMA_NAME}.avpr"

# Compile the AVDL to the AVPR
java -jar avro-tools.jar idl "${AVDL_FILE}" "${AVPR_FILE}"

# Use Avrodoc to make a per-API documentation file.
HTML_FILE="../target/documentation/${SCHEMA_NAME}.html"
avrodoc "${AVPR_FILE}" > "${HTML_FILE}"

done



26 changes: 26 additions & 0 deletions GeneratingDocumentation.md → doc/GeneratingDocumentation.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,3 +52,29 @@ mkdir -p target/documentation
avrodoc target/schemas/reads.avpr > target/documentation/reads.html

```

### Automating the process

Once you have installed Avrodoc, you can run the `contrib/make_avrodoc.sh` script to automate the above process, building Avrodoc HTML files for each `.avdl` in `target/documentation`:

```shell
contrib/make_avrodoc.sh
```

## Generating UML Diagrams

There is a UML class diagram, `doc/uml.dia`, that describes the layout of the GA4GH data model.

UML class diagrams can be partially generated from the schemas by using [Doxygen](http://www.doxygen.org/) to generate XML, and then using [Dia](http://live.gnome.org/Dia) to generate a UML from that XML output. Unfortunately, this only imports the Avro types: dependencies and layout still need to be done manually.

The `contrib` folder contains a `Doxyfile` and a rudimentary filter (`avdlDoxyFilter.py`) that can be used to generate Doxygen XML that DIA can import. To use them, simply do:

```shell
# Go into the contrib directory
cd contrib

# run Doxygen, which will put XML docs in ../doc/doxygen/XML
doxygen
```

Then, open up Dia, do `File -> Open`, set the input file type to `Dox2UML (Multiple)`, and open `doc/doxygen/XML/index.xml`. Dia will generate UML classes for all the schema types, which you can lay out into a UML class diagram.
66 changes: 66 additions & 0 deletions doc/GraphModeFAQ.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
#Graph Mode FAQ

This document holds frequently asked questions about the new graph mode, and how various tasks can be accomplished in graph mode and in classic mode.

If you have a relevant question, please add it to this document in a pull request.

##What does a SNP look like in graph versus classic mode?

In "classic" mode, a SNP is represented by a `Variant`, with `referenceBases` set to one base, and `alternateBases` set to the other.

In "graph" mode, a SNP exists as a single-base `Sequence` with the alternate base, joined with two `Join`s onto the `Sequence` with the original base, like this:

```
  -G-
/ \
--A--C--T--G--C--A--
```

To express the genotype of this SNP, a variant caller will need to emit a pair of `Allele`s, one of which follows a single-base path through the original base, and one of which follows a single-base path through the alternate base. It would then emit `AlleleCall`s noting the copy number of each `Allele` in each `CallSet`.

The variant caller may additionally emit a `Variant` tying the two `Allele`s together, and giving genotypes in more traditional notation.

##What does a short indel look like in graph versus classic mode?

In "classic" mode, an indel is represented by a `Variant`, with `referenceBases` set to "" (for an insertion) or some bases (for a deletion), and `alternateBases` set to the inserted bases (for an insertion) or "" (for a deletion).

In "graph" mode, an insertion exists as a `Sequence` with the inserted bases, joined onto the modified `Sequence` with `Join`s such that it connects the endpoints of the indel, like this:

```
Insertion:
  -C--A-
  / ____/
  / /
||
/\
--A--C--T--G--C--A--
```

A deletion is represented by a single `Join` skipping the deleted bases, like this:

```
Deletion:
--A--C--T--G--C--A--
\_________/
```

To express the genotype of an indel, a variant caller will need to emit a pair of `Allele`s, one of which follows the path with the extra bases, and one of which follows the 0-length path consisting of the adjacency broken by the insertion or created by the deletion. The caller would then emit `AlleleCall`s noting the copy number of each `Allele` in each `CallSet`.

The variant caller may additionally emit a `Variant` tying the two `Allele`s together, and giving genotypes in more traditional notation.

##How do I walk the graph to find all variants within 10kbp of a specific position?

In "classic" mode, one can issue a `searchVariants()` call interrogating the range 10kb upstream and downstream of the position of interest. All `Variant`s overlapping that range would be returned.

In "graph" mode, the situation is more complicated. You want to perform a recursive search of the graph out to a distance of 10kb from your start position, following all possible paths.

You can use `searchJoins()` to get information about all the `Sequence`s attached to the `Sequence` with the position you are interested in, within a 10kb window around your position of interest, and attached such that it is possible to read into them in the direction you are traversing the parent. You would have to recurse down into each such attached `Sequence` (retrieved with `getSequence()`), work out how far in from the joined end you can get with whatever is left of your 10kb window size after walking out to where the join is, and recursively search that region for more children.

Once you have determined all the ranges on all the `Sequence`s that are "within 10kb" of your starting position, you can make a `searchAlleles()` call on each of them to get all `Allele` objects involving any bases within 10kb of your start position. If any are associated with `Variant` objects, you can use the `getVariant()` call to retrieve those `Variant`s by ID.

If you are only interested in `Variant` objects with reference `Allele`s overlapping your chosen ranges, you can use `searchvariants()` calls instead of `searchAlleles()` calls. This will ignore `Allele`s which are not part of `Variant`s, or which are not the reference `Allele`s for their `Variant`s.



Loading

0 comments on commit b568aa1

Please sign in to comment.