"Match" table, tree, and sample metadata, and verify that things seem…

… ok (#154) * BUG/TST: Add back in data matching/checking code Closes #139, for real this time. Eventually we'll need to check that feature metadata matches up, but that is its own problem for later down the road. * STY: fix flake8 complaint * DOC: add all needed moving pix files & "make docs" Not sure why these files weren't here before, but this will make rerunning the tutorial easy. Also "make docs" is just a shorthand that saves extra typing when re-visualizing the moving pictures tree. We could integrate this into the travis build in the future if desired (of course this would be predicated on us getting QIIME 2 set up in the travis build, which would add on a few minutes to each build due to Q2 installation taking some time). * DOC: typo fix [ci skip] * BUG: Transpose feature tbl before matching it So apparently QIIME 2's transformers from biom table -> pd DataFrame produce DFs that are transposed from what biom.Table does -- QIIME 2 uses samples as the indices (rows) and features as the columns, while biom.Table does it the other way around. As you can imagine, this is pretty confusing! This commit should fix this problem from our end, but in the future we should really add logic to prevent having to do table-DF-transposition, since IIRC that can be super slow with massive DFs. (...We really oughta unit-test _plot.) * STY: rm extra blank line * TST: rename a prev matching test and add skeletons * TST: Add "no features shared" test for matching part of #139 fixes * TST: test a warning msg printed during matching * TST: Add sample dropping warning test think this pr should be good for now * DOC: add note to match_inputs() re #130 (TODO) * TST: Install and use QIIME 2 env in travis build * TST: Add actual Q2 integration test! Addresses @ElDeveloper's comment on #154. I'm keeping 'make docs' around since it could still be nifty (if you just wanna regenerate the empress-tree.qzv file without rerunning the tests, I guess). * TST: don't run 'make docs' on travis build Since the Q2 Artifact API test I just added does the same thing. * TST: Add rough Q2 visualization check #154 Addresses comment from @ElDeveloper * STY: Remove blank lines in match_inputs() Co-Authored-By: Yoshiki Vázquez Baeza <[email protected]> * STY: more blank line removals in docstring Co-Authored-By: Yoshiki Vázquez Baeza <[email protected]> * STY: rm blank lines in print_if_dropped docstring Co-Authored-By: Yoshiki Vázquez Baeza <[email protected]> * MNT: warn instead of printing re: sample dropping Tests haven't been updated yet -- will do so when --ignore-missing-samples option added in. (So this will currently break the tests.) This represents part of the work on addressing @ElDeveloper's comments on #154. * ENH: add UI skeleton for no-data sample/feat flags Per suggestion from @ElDeveloper in #154 * STY: make _plot inputs prettier * DOC: add ref to emperor --ignore-missing-samples * DOC: Remove 'standalone' instructions in README Just for now. When we resolve #140, we should add these instructions back in (likely we'll also have to adjust these when we get to the 'initial release' of Empress on PyPI / conda-forge / etc.) * DOC: switch feature/sample flag order, imprv docs * ENH: Add @ElDeveloper's suggested filtering flags This entailed substantial restructuring of match_inputs(). I also completely deleted warn_if_dropped(), because it was honestly easier to replace it with custom error messages for each of its 3 usages. (Also, that thing was like 50 lines of docstring / infrastructure for 8 lines of code. It was gnarly. :P) This isn't done yet! I still need to test this new behavior thoroughly, and to update the tests for the old functionality accordingly. * MNT: Avoid redundant table DF transpositions #155 * BUG: don't display useless warning in most cases * TST: reduce tests to just one working one will add more back (with relevant changes to work with new behavior) soon * DOC: add TODO note re empty checking * TST: add back "simple" matching error tests * TST: add + beef up tests of matching warnings, etc * TST: add --p-ignore-missing-samples tests * TST: add another cornercase test * TST: test final "warning" in matching func for now also fixed a bug in prev test i just added in, and removed extraneous comment * TST: Add other check for extra s.m. sample warning I think I'm satisfied with the new matching behavior tests, at least for now * DOC: update example QZV :) * MNT: don't warn on dropped samples from s.metadata See new comment for justification. Addresses comment from @ElDeveloper. Co-authored-by: Yoshiki Vázquez Baeza <[email protected]>
biocore · Apr 20, 2020 · 2bd92b3 · 2bd92b3
1 parent cb25912
commit 2bd92b3
Show file tree

Hide file tree

Showing 14 changed files with 581 additions and 53 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,9 +1,4 @@
-# trees/metadata used for development
 data/
-*.qza
-*.qzv
-!docs/*/*.qzv
-*.tsv
 empress-biom.py
 
 # NodeJS file used for chrome headless QUnit tests

diff --git a/.travis.yml b/.travis.yml
@@ -13,13 +13,15 @@ before_install:
   - export PATH=/home/travis/miniconda2/bin:$PATH
   # Update conda itself
   - conda update --yes conda
+  # Install the latest QIIME 2 version. These lines copied from
+  # https://github.com/biocore/qurro/blob/d71f55b82f427c0d7d3db80bfd629e2ae1b6a335/.travis.yml.
+  - wget https://raw.githubusercontent.com/qiime2/environment-files/master/latest/staging/qiime2-latest-py36-linux-conda.yml
+  - travis_retry conda env create -n qiime2-dev --file qiime2-latest-py36-linux-conda.yml
+  - source activate qiime2-dev
 install:
+  - pip install -e .[all] --verbose
   - npm install -g qunit-puppeteer jshint prettier
-  # install requests using conda to avoid a distutils error in certifi
-  - conda create --yes -n travis python=3.6 pip numpy scipy matplotlib pandas flake8 pep8  cython nose
-  - source activate travis
-  - conda install -c bioconda scikit-bio biom-format --yes
-  - pip install -e . --verbose
 script:
+  - qiime dev refresh-cache
   - make test
   - make stylecheck
diff --git a/Makefile b/Makefile
@@ -6,7 +6,7 @@
 # Requires that a few command-line utilities are installed; see the Travis-CI
 # config file (.travis.yml) for examples of installing these utilities.
 
-.PHONY: test pytest jstest stylecheck jsstyle githook
+.PHONY: test pytest jstest stylecheck jsstyle githook docs
 
 JSLOCS = empress/support_files/js/*.js
 
@@ -38,3 +38,13 @@ githook:
 	@# https://www.viget.com/articles/two-ways-to-share-git-hooks-with-your-team/
 	echo "#!/bin/bash\nmake stylecheck" > .git/hooks/pre-commit
 	chmod +x .git/hooks/pre-commit
+
+docs:
+	@# For now, this just regenerates the moving pictures QZV
+	@# Assumes you're in a QIIME 2 conda environment
+	qiime empress plot \
+		--i-tree docs/moving-pictures/rooted-tree.qza \
+		--i-feature-table docs/moving-pictures/table.qza \
+		--m-sample-metadata-file docs/moving-pictures/sample_metadata.tsv \
+		--m-feature-metadata-file docs/moving-pictures/taxonomy.qza \
+		--o-visualization docs/moving-pictures/empress-tree.qzv
diff --git a/README.md b/README.md
@@ -5,20 +5,6 @@ Empress is a fast and scalable phylogenetic tree viewer.
 
 ## Installation
 
-### Installing Empress "Standalone"
-
-To install the current development version, we recommend creating a new conda
-environment:
-
-```bash
-conda create -n empress python=3.6 numpy scipy pandas cython
-conda activate empress
-conda install -c bioconda scikit-bio biom-format
-pip install git+https://github.com/biocore/empress.git
-```
-
-### Installing Empress through QIIME 2
-
 Before following these instructions, make sure your QIIME 2 conda environment
 is activated (a version of at least 2019.1 is required). Then, run the
 following commands:

diff --git a/docs/moving-pictures/empress-tree.qzv b/docs/moving-pictures/empress-tree.qzv
diff --git a/docs/moving-pictures/rooted-tree.qza b/docs/moving-pictures/rooted-tree.qza
diff --git a/docs/moving-pictures/sample_metadata.tsv b/docs/moving-pictures/sample_metadata.tsv
@@ -0,0 +1,36 @@
+sample-id	barcode-sequence	body-site	year	month	day	subject	reported-antibiotic-usage	days-since-experiment-start
+#q2:types	categorical	categorical	numeric	numeric	numeric	categorical	categorical	numeric
+L1S8	AGCTGACTAGTC	gut	2008	10	28	subject-1	Yes	0
+L1S57	ACACACTATGGC	gut	2009	1	20	subject-1	No	84
+L1S76	ACTACGTGTGGT	gut	2009	2	17	subject-1	No	112
+L1S105	AGTGCGATGCGT	gut	2009	3	17	subject-1	No	140
+L2S155	ACGATGCGACCA	left palm	2009	1	20	subject-1	No	84
+L2S175	AGCTATCCACGA	left palm	2009	2	17	subject-1	No	112
+L2S204	ATGCAGCTCAGT	left palm	2009	3	17	subject-1	No	140
+L2S222	CACGTGACATGT	left palm	2009	4	14	subject-1	No	168
+L3S242	ACAGTTGCGCGA	right palm	2008	10	28	subject-1	Yes	0
+L3S294	CACGACAGGCTA	right palm	2009	1	20	subject-1	No	84
+L3S313	AGTGTCACGGTG	right palm	2009	2	17	subject-1	No	112
+L3S341	CAAGTGAGAGAG	right palm	2009	3	17	subject-1	No	140
+L3S360	CATCGTATCAAC	right palm	2009	4	14	subject-1	No	168
+L5S104	CAGTGTCAGGAC	tongue	2008	10	28	subject-1	Yes	0
+L5S155	ATCTTAGACTGC	tongue	2009	1	20	subject-1	No	84
+L5S174	CAGACATTGCGT	tongue	2009	2	17	subject-1	No	112
+L5S203	CGATGCACCAGA	tongue	2009	3	17	subject-1	No	140
+L5S222	CTAGAGACTCTT	tongue	2009	4	14	subject-1	No	168
+L1S140	ATGGCAGCTCTA	gut	2008	10	28	subject-2	Yes	0
+L1S208	CTGAGATACGCG	gut	2009	1	20	subject-2	No	84
+L1S257	CCGACTGAGATG	gut	2009	3	17	subject-2	No	140
+L1S281	CCTCTCGTGATC	gut	2009	4	14	subject-2	No	168
+L2S240	CATATCGCAGTT	left palm	2008	10	28	subject-2	Yes	0
+L2S309	CGTGCATTATCA	left palm	2009	1	20	subject-2	No	84
+L2S357	CTAACGCAGTCA	left palm	2009	3	17	subject-2	No	140
+L2S382	CTCAATGACTCA	left palm	2009	4	14	subject-2	No	168
+L3S378	ATCGATCTGTGG	right palm	2008	10	28	subject-2	Yes	0
+L4S63	CTCGTGGAGTAG	right palm	2009	1	20	subject-2	No	84
+L4S112	GCGTTACACACA	right palm	2009	3	17	subject-2	No	140
+L4S137	GAACTGTATCTC	right palm	2009	4	14	subject-2	No	168
+L5S240	CTGGACTCATAG	tongue	2008	10	28	subject-2	Yes	0
+L6S20	GAGGCTCATCAT	tongue	2009	1	20	subject-2	No	84
+L6S68	GATACGTCCTGA	tongue	2009	3	17	subject-2	No	140
+L6S93	GATTAGCACTCT	tongue	2009	4	14	subject-2	No	168
diff --git a/docs/moving-pictures/table.qza b/docs/moving-pictures/table.qza
diff --git a/docs/moving-pictures/taxonomy.qza b/docs/moving-pictures/taxonomy.qza
diff --git a/empress/_plot.py b/empress/_plot.py
@@ -24,29 +24,56 @@
 TEMPLATES = os.path.join(SUPPORT_FILES, 'templates')
 
 
-def plot(output_dir: str,
-         tree: NewickFormat,
-         feature_table: pd.DataFrame,
-         sample_metadata: qiime2.Metadata,
-         feature_metadata: qiime2.Metadata = None) -> None:
+def plot(
+    output_dir: str,
+    tree: NewickFormat,
+    feature_table: pd.DataFrame,
+    sample_metadata: qiime2.Metadata,
+    feature_metadata: qiime2.Metadata = None,
+    ignore_missing_samples: bool = False,
+    filter_missing_features: bool = False
+) -> None:
+
+    # 1. Convert inputs to the formats we want
 
     # TODO: do not ignore the feature metadata when specified by the user
     if feature_metadata is not None:
         feature_metadata = feature_metadata.to_dataframe()
 
+    sample_metadata = sample_metadata.to_dataframe()
+
     # create/parse tree
     tree_file = str(tree)
     # path to the actual newick file
     with open(tree_file) as file:
         t = parse_newick(file.readline())
+    empress_tree = Tree.from_tree(to_skbio_treenode(t))
+    tools.name_internal_nodes(empress_tree)
+
+    # 2. Now that we've converted/read/etc. all of the four input sources,
+    # ensure that the samples and features they describe "match up" sanely.
+
+    # Note that the feature_table we get from QIIME 2 (as an argument to this
+    # function) is set up such that the index describes sample IDs and the
+    # columns describe feature IDs. We transpose this table before sending it
+    # to tools.match_inputs() and keep using the transposed table for the rest
+    # of this visualizer.
+
+    feature_table, sample_metadata = tools.match_inputs(
+        empress_tree, feature_table.T, sample_metadata, feature_metadata,
+        ignore_missing_samples, filter_missing_features
+    )
+
+    # TODO: Add a check for empty samples/features in the table? Filtering this
+    # sorta stuff out would help speed things up (and would be good to report
+    # to the user on via warnings).
+
+    # 3. Go forward with creating the Empress visualization!
 
     # extract balance parenthesis
     bp_tree = list(t.B)
 
-    # calculate tree coordinates
-    empress_tree = Tree.from_tree(to_skbio_treenode(t))
-    tools.name_internal_nodes(empress_tree)
-
+    # Compute coordinates resulting from layout algorithm(s)
     # TODO: figure out implications of screen size
     layout_to_coordsuffix, default_layout = empress_tree.coords(4020, 4020)
 
@@ -83,30 +110,24 @@ def plot(output_dir: str,
     env = Environment(loader=FileSystemLoader(TEMPLATES))
     temp = env.get_template('empress-template.html')
 
-    # sample metadata
-    sample_data = sample_metadata \
-        .to_dataframe().filter(feature_table.index, axis=0) \
-        .to_dict(orient='index')
+    # Convert sample metadata to a JSON-esque format
+    sample_data = sample_metadata.to_dict(orient='index')
 
     # TODO: Empress is currently storing all metadata as strings. This is
-    # memory intensive and wont scale well. We should convert all numeric
+    # memory intensive and won't scale well. We should convert all numeric
     # data/compress metadata.
 
     # This is used in biom-table. Currently this is only used to ignore null
-    # data (i.e. NaN and "unknown") and also determines sorting order.
-    # The original intent is to signal what
-    # columns are discrete/continous.
+    # data (i.e. NaN and "unknown") and also determines sorting order. The
+    # original intent is to signal what columns are discrete/continuous.
     # type of sample metadata (n - number, o - object)
-    sample_data_type = sample_metadata \
-        .to_dataframe().filter(feature_table.index, axis=0) \
-        .dtypes \
-        .to_dict()
+    sample_data_type = sample_metadata.dtypes.to_dict()
     sample_data_type = {k: 'n' if pd.api.types.is_numeric_dtype(v) else 'o'
                         for k, v in sample_data_type.items()}
 
     # create a mapping of observation ids and the samples that contain them
     obs_data = {}
-    feature_table = (feature_table > 0).T
+    feature_table = (feature_table > 0)
     for _, series in feature_table.iteritems():
         sample_ids = series[series].index.tolist()
         obs_data[series.name] = sample_ids

diff --git a/empress/plugin_setup.py b/empress/plugin_setup.py
@@ -9,7 +9,7 @@
 
 from ._plot import plot
 
-from qiime2.plugin import Plugin, Metadata, Citations
+from qiime2.plugin import Plugin, Metadata, Bool, Citations
 from q2_types.tree import Phylogeny, Rooted
 from q2_types.feature_table import FeatureTable, Frequency
 
@@ -36,25 +36,49 @@
     },
     parameters={
         'sample_metadata': Metadata,
-        'feature_metadata': Metadata
+        'feature_metadata': Metadata,
+        'ignore_missing_samples': Bool,
+        'filter_missing_features': Bool
     },
     input_descriptions={
         'tree': 'The phylogenetic tree to visualize.',
         'feature_table': (
-            'The feature table relating samples to features in the tree. '
+            'A table containing the abundances of features within samples. '
             'This information allows us to decorate the phylogeny by '
-            'sample metadata.'
+            "sample metadata. It's expected that all features in the table "
+            'are also present as tips in the tree, and that all samples in '
+            'the table are also present in the sample metadata file.'
         )
     },
     parameter_descriptions={
         'sample_metadata': (
             'Sample metadata. Can be used to color tips in the tree by '
-            'the samples they are unique to.'
+            'the samples they are unique to. Samples described in the '
+            'metadata that are not present in the feature table will '
+            'be automatically filtered out of the visualization.'
         ),
         'feature_metadata': (
             'Feature metadata. Not currently used for anything, but will '
             'be soon.'
-        )
+        ),
+        # Parameter descriptions adapted from q2-emperor's
+        # --p-ignore-missing-samples flag.
+        'ignore_missing_samples': (
+            'This will suppress the error raised when the feature table '
+            'contains samples that are not present in the sample metadata. '
+            'Samples without metadata are included in the visualization by '
+            'setting all of their metadata values to "This sample has no '
+            'metadata". Note that this flag will only be applied if at least '
+            'one sample is present in both the feature table and the metadata.'
+        ),
+        'filter_missing_features': (
+            'This will suppress the error raised when the feature table '
+            'contains features that are not present as tips in the tree. '
+            'These features will be removed from the visualization if this '
+            'flag is passed. Note that this flag will only be applied if '
+            'at least one feature in the table is also present as a tip in '
+            'the tree.'
+        ),
     },
     name='Visualize and Explore Phylogenies with Empress',
     description=(