Adding documentation

Adding in Doxygen capabilities for generating UML. Adding UML diagram. Specifying how the UML diagram can be generated. Automating Avrodoc build with a script. Adding proper Beacon stuff to the UML Updating UML to drop AlleleResource Adding a Graph Mode FAQ It would be good to have the answers to people's questions about graph mode all in one place. Moving and renaming documentation All the extra Markdowns should go in doc/, and should not have spaces in the filenames. Adding an SVG of the UML to the repo. Make the FAQ make sense with the side graph changes.
ga4gh · Apr 3, 2015 · b568aa1 · b568aa1
1 parent 1cbf5ba
commit b568aa1
Show file tree

Hide file tree

Showing 7 changed files with 25,615 additions and 0 deletions.
diff --git a/contrib/Doxyfile b/contrib/Doxyfile
diff --git a/contrib/avdlDoxyFilter.py b/contrib/avdlDoxyFilter.py
@@ -0,0 +1,124 @@
+#!/usr/bin/env python2.7
+"""
+avdlDoxyFilter.py: hack Avro IDL files into vaguely C++-like files that Doxygen
+can read.
+
+Re-uses sample code and documentation from
+<http://users.soe.ucsc.edu/~karplus/bme205/f12/Scaffold.html>
+"""
+
+import argparse, sys, os, itertools, re
+
+def parse_args(args):
+    """
+    Takes in the command-line arguments list (args), and returns a nice argparse
+    result with fields for all the options.
+    Borrows heavily from the argparse documentation examples:
+    <http://docs.python.org/library/argparse.html>
+    """
+
+    # The command line arguments start with the program name, which we don't
+    # want to treat as an argument for argparse. So we remove it.
+    args = args[1:]
+
+    # Construct the parser (which is stored in parser)
+    # Module docstring lives in __doc__
+    # See http://python-forum.com/pythonforum/viewtopic.php?f=3&t=36847
+    # And a formatter class so our examples in the docstring look good. Isn't it
+    # convenient how we already wrapped it to 80 characters?
+    # See http://docs.python.org/library/argparse.html#formatter-class
+    parser = argparse.ArgumentParser(description=__doc__,
+        formatter_class=argparse.RawDescriptionHelpFormatter)
+
+    # Now add all the options to it
+    parser.add_argument("avdl", type=argparse.FileType('r'),
+        help="the AVDL file to read")
+
+    return parser.parse_args(args)
+
+
+def main(args):
+    """
+    Parses command line arguments, and does the work of the program.
+    "args" specifies the program arguments, with args[0] being the executable
+    name. The return value should be used as the program's exit code.
+    """
+
+    options = parse_args(args) # This holds the nicely-parsed options object
+
+    # Are we in a comment?
+    in_comment = False
+
+    # What level of braces are we in?
+    brace_level = 0;
+
+    for line in options.avdl:
+        # For every line of Avro
+
+        # See if it's a comment start or end.
+        comment_starter = line.rfind("/*")
+        comment_ender = line.rfind("*/")
+
+        if(comment_starter != -1 and (comment_ender == -1 or
+            comment_ender < comment_starter)):
+            # We have entered a multiline comment
+
+            in_comment = True
+        elif comment_ender != -1:
+            # We have ended a multiline comment and not started another one.
+            in_comment = False
+
+        if in_comment:
+            # Just pass comments as-is
+            print(line.rstrip())
+            continue
+
+        # How many unbalanced braces do we have outside comments?
+        brace_change = line.count("{") - line.count("}")
+
+        if line.lstrip().startswith("protocol"):
+            # It's a protocol, so make it a Module and an Interface.
+
+            # Grab the protocol name
+            name = re.search('protocol\s+(\S+)', line).group(1)
+
+            # Make the open lines
+            print("namespace {} {{".format(name))
+            #print("interface {} {{".format(name))
+
+        elif line.lstrip().startswith("record"):
+            # It's a record, so make it a Struct.
+
+            # Grab the record name
+            name = re.search('record\s+(\S+)', line).group(1)
+
+            print("struct {} {{".format(name))
+
+        elif line.lstrip().startswith("union"):
+            # We need to fix up the union with semicolons.
+
+            # Parse out the union
+            match = re.search("union\s*{(.*)}(.*)", line)
+
+            # What got unioned?
+            unioned = match.group(1)
+
+            # What's the rest of the line?
+            rest = match.group(2)
+
+            # Make the union a template as far as Doxygen knows.
+            print("union<{}>{}".format(unioned, rest))
+
+
+        elif line.rstrip().endswith("}"):
+            # The line is closing something, so it needs a semicolon.
+            print("{};".format(line.rstrip()))
+        else:
+            # Pass other lines
+            print(line.rstrip())
+
+        # Change the brace level.
+        brace_level += brace_change
+
+if __name__ == "__main__" :
+    sys.exit(main(sys.argv))
diff --git a/contrib/make_avrodoc.sh b/contrib/make_avrodoc.sh
@@ -0,0 +1,47 @@
+#!/usr/bin/env bash
+
+# Script to make the avrodoc documentation. Run from the contrib folder:
+# $ contrib/make_avrodoc.sh
+# Depends on avrodoc already being on the PATH.
+# Can install the Avro command line tools jar itself.
+
+if [ -d contrib ]
+then
+    # Make sure we are in the contrib directory.
+    cd contrib
+fi
+
+if [ ! -f avro-tools.jar ]
+then
+
+    # Download the Avro tools
+    curl -o avro-tools.jar  http://www.us.apache.org/dist/avro/avro-1.7.7/java/avro-tools-1.7.7.jar
+fi
+
+# Make a directory for all the .avpr files
+mkdir -p ../target/schemas
+
+# Make a place to put the documentation
+mkdir -p ../target/documentation
+
+for AVDL_FILE in ../src/main/resources/avro/*.avdl
+do
+    # Make each AVDL file into a JSON AVPR file.
+
+    # Get the name of the AVDL file without its extension or path
+    SCHEMA_NAME=$(basename "$AVDL_FILE" .avdl)
+
+    # Decide what AVPR file it will become.
+    AVPR_FILE="../target/schemas/${SCHEMA_NAME}.avpr"
+
+    # Compile the AVDL to the AVPR
+    java -jar avro-tools.jar idl "${AVDL_FILE}" "${AVPR_FILE}"
+
+    # Use Avrodoc to make a per-API documentation file.
+    HTML_FILE="../target/documentation/${SCHEMA_NAME}.html"
+    avrodoc "${AVPR_FILE}" > "${HTML_FILE}"
+
+done
+
+
+
diff --git a/GeneratingDocumentation.md → doc/GeneratingDocumentation.md b/GeneratingDocumentation.md → doc/GeneratingDocumentation.md
@@ -52,3 +52,29 @@ mkdir -p target/documentation
 avrodoc target/schemas/reads.avpr > target/documentation/reads.html
 
 ```
+
+### Automating the process
+
+Once you have installed Avrodoc, you can run the `contrib/make_avrodoc.sh` script to automate the above process, building Avrodoc HTML files for each `.avdl` in `target/documentation`:
+
+```shell
+contrib/make_avrodoc.sh
+```
+
+## Generating UML Diagrams
+
+There is a UML class diagram, `doc/uml.dia`, that describes the layout of the GA4GH data model.
+
+UML class diagrams can be partially generated from the schemas by using [Doxygen](http://www.doxygen.org/) to generate XML, and then using [Dia](http://live.gnome.org/Dia) to generate a UML from that XML output. Unfortunately, this only imports the Avro types: dependencies and layout still need to be done manually.
+
+The `contrib` folder contains a `Doxyfile` and a rudimentary filter (`avdlDoxyFilter.py`) that can be used to generate Doxygen XML that DIA can import. To use them, simply do:
+
+```shell
+# Go into the contrib directory
+cd contrib
+
+# run Doxygen, which will put XML docs in ../doc/doxygen/XML
+doxygen
+```
+
+Then, open up Dia, do `File -> Open`, set the input file type to `Dox2UML (Multiple)`, and open `doc/doxygen/XML/index.xml`. Dia will generate UML classes for all the schema types, which you can lay out into a UML class diagram.
diff --git a/doc/GraphModeFAQ.md b/doc/GraphModeFAQ.md
@@ -0,0 +1,66 @@
+#Graph Mode FAQ
+
+This document holds frequently asked questions about the new graph mode, and how various tasks can be accomplished in graph mode and in classic mode.
+
+If you have a relevant question, please add it to this document in a pull request.
+
+##What does a SNP look like in graph versus classic mode?
+
+In "classic" mode, a SNP is represented by a `Variant`, with `referenceBases` set to one base, and `alternateBases` set to the other.
+
+In "graph" mode, a SNP exists as a single-base `Sequence` with the alternate base, joined with two `Join`s onto the `Sequence` with the original base, like this:
+
+```
+       -G-
+      /   \
+--A--C--T--G--C--A--
+```
+
+To express the genotype of this SNP, a variant caller will need to emit a pair of `Allele`s, one of which follows a single-base path through the original base, and one of which follows a single-base path through the alternate base. It would then emit `AlleleCall`s noting the copy number of each `Allele` in each `CallSet`.
+
+The variant caller may additionally emit a `Variant` tying the two `Allele`s together, and giving genotypes in more traditional notation.
+
+##What does a short indel look like in graph versus classic mode?
+
+In "classic" mode, an indel is represented by a `Variant`, with `referenceBases` set to "" (for an insertion) or some bases (for a deletion), and `alternateBases` set to the inserted bases (for an insertion) or "" (for a deletion).
+
+In "graph" mode, an insertion exists as a `Sequence` with the inserted bases, joined onto the modified `Sequence` with `Join`s such that it connects the endpoints of the indel, like this:
+
+```
+Insertion:
+
+        -C--A-
+       / ____/
+      / /
+      ||
+      /\
+--A--C--T--G--C--A--
+```
+
+A deletion is represented by a single `Join` skipping the deleted bases, like this:
+
+```
+Deletion:
+
+--A--C--T--G--C--A--
+   \_________/
+```
+
+To express the genotype of an indel, a variant caller will need to emit a pair of `Allele`s, one of which follows the path with the extra bases, and one of which follows the 0-length path consisting of the adjacency broken by the insertion or created by the deletion. The caller would then emit `AlleleCall`s noting the copy number of each `Allele` in each `CallSet`.
+
+The variant caller may additionally emit a `Variant` tying the two `Allele`s together, and giving genotypes in more traditional notation.
+
+##How do I walk the graph to find all variants within 10kbp of a specific position?
+
+In "classic" mode, one can issue a `searchVariants()` call interrogating the range 10kb upstream and downstream of the position of interest. All `Variant`s overlapping that range would be returned.
+
+In "graph" mode, the situation is more complicated. You want to perform a recursive search of the graph out to a distance of 10kb from your start position, following all possible paths. 
+
+You can use `searchJoins()` to get information about all the `Sequence`s attached to the `Sequence` with the position you are interested in, within a 10kb window around your position of interest, and attached such that it is possible to read into them in the direction you are traversing the parent. You would have to recurse down into each such attached `Sequence` (retrieved with `getSequence()`), work out how far in from the joined end you can get with whatever is left of your 10kb window size after walking out to where the join is, and recursively search that region for more children.
+
+Once you have determined all the ranges on all the `Sequence`s that are "within 10kb" of your starting position, you can make a `searchAlleles()` call on each of them to get all `Allele` objects involving any bases within 10kb of your start position. If any are associated with `Variant` objects, you can use the `getVariant()` call to retrieve those `Variant`s by ID.
+
+If you are only interested in `Variant` objects with reference `Allele`s overlapping your chosen ranges, you can use `searchvariants()` calls instead of `searchAlleles()` calls. This will ignore `Allele`s which are not part of `Variant`s, or which are not the reference `Allele`s for their `Variant`s.
+
+
+