"Match" table, tree, and sample metadata, and verify that things seem ok #154

fedarko · 2020-04-08T01:03:56Z

Closes #139. Now, the sort of case described there will trigger the following error:

We will still need to match/validate feature metadata (e.g. taxonomy) when we add support for that in to Empress (#130) -- I haven't done that in this PR, but I can add it in if you'd like. (Wasn't sure how we'd prefer to handle feature metadata, so I haven't included this here.)

Also changed in this PR: I moved the QZAs/etc. needed for the moving pictures example back into the repository (in docs/moving-pictures/), and added a make docs command that'll re-generate the empress-tree.qzv visualization linked from the README. We could make running make docs part of the Travis build in the future if desired, as a de facto "integration test" (of course, this would require us to have QIIME 2 installed on Travis, which would slow down builds by a few minutes).

Thanks! Let me know if you'd like to go over this over Zoom or whatever sometime.

Closes biocore#139, for real this time. Eventually we'll need to check that feature metadata matches up, but that is its own problem for later down the road.

Not sure why these files weren't here before, but this will make rerunning the tutorial easy. Also "make docs" is just a shorthand that saves extra typing when re-visualizing the moving pictures tree. We could integrate this into the travis build in the future if desired (of course this would be predicated on us getting QIIME 2 set up in the travis build, which would add on a few minutes to each build due to Q2 installation taking some time).

So apparently QIIME 2's transformers from biom table -> pd DataFrame produce DFs that are transposed from what biom.Table does -- QIIME 2 uses samples as the indices (rows) and features as the columns, while biom.Table does it the other way around. As you can imagine, this is pretty confusing! This commit should fix this problem from our end, but in the future we should really add logic to prevent having to do table-DF-transposition, since IIRC that can be super slow with massive DFs. (...We really oughta unit-test _plot.)

part of biocore#139 fixes

think this pr should be good for now

antgonza

Thanks @fedarko, I think this is important.

A few comments; also perhaps worth adding (fine to open as issues if not part of this PR):

adding/running flake8 to the review the code, currently is being installed but it not being used to review the code
adding the coverage batch in the README
if you have the qza files in the repo, perhaps worth adding a call to generate the empress.qzv at the end of the build?

antgonza · 2020-04-08T12:51:49Z

empress/_plot.py

    # TODO: do not ignore the feature metadata when specified by the user
    if feature_metadata is not None:
        feature_metadata = feature_metadata.to_dataframe()

+    sample_metadata = sample_metadata.to_dataframe()


If a large mapping file is passed, this is going to be super expensive, right?

IMO the main bottleneck with huge sample metadata in Empress will be in having the browser load it into memory, rather than in the python side of things.

That being said, yeah, it would make sense to avoid or delegate the qiime2.Metadata -> pd.DataFrame conversion if possible. I'm hesitant to make changes to this here because this code isn't part of this PR, but happy to open an issue if you'd like.

@antgonza do you know of other ways to handle metadata that would be more memory efficient? This is roughly the same way that we handle metadata in Emperor and other plugins. Like @fedarko says, the real problem is likely going to be with the feature metadata.

Good question! I was thinking in not loading the full sample metadata dataframe into memory but use the qiime2 methods to retrieve what you need; for example:
Metadata.get_column
MetadataColumn.to_series
etc

However, not sure if this will help the footprint or not ...

Gotcha, my understanding is that Metadata parses and loads the whole mapping file into memory.

empress/_plot.py

empress/tools.py

fedarko · 2020-04-08T21:40:15Z

Thanks for the review @antgonza!

Addressing your three comments:

adding/running flake8 to the review the code, currently is being installed but it not being used to review the code

This is already being done -- see the Travis log for this PR as of writing. The make stylecheck directive (all the way down at the bottom of the log) is running flake8 to check the python code, as well as some JS libraries to check that half of the codebase.

adding the coverage batch in the README

This is a really good idea, and will help a lot with figuring out where the "holes" are in our test suite (e.g. #142). Opened an issue for this in #156.

if you have the qza files in the repo, perhaps worth adding a call to generate the empress.qzv at the end of the build?

I'm down to add this, but as mentioned this will automatically increase build times by at least a few minutes (installing QIIME 2 takes some time). @ElDeveloper / @kwcantrell what do you think?

ElDeveloper · 2020-04-09T00:00:02Z

if you have the qza files in the repo, perhaps worth adding a call to generate the empress.qzv at the end of the build?
I'm down to add this, but as mentioned this will automatically increase build times by at least a few minutes (installing QIIME 2 takes some time). @ElDeveloper / @kwcantrell what do you think?

In general I agree, however I would suggest testing the QZV generation via Python. And installing QIIME2 should be fine. We can test the plugin like it's done in q2-diversity:

https://github.com/qiime2/q2-diversity/blob/master/q2_diversity/tests/test_beta_correlation.py

@ElDeveloper

Addresses @ElDeveloper's comment on biocore#154. I'm keeping 'make docs' around since it could still be nifty (if you just wanna regenerate the empress-tree.qzv file without rerunning the tests, I guess).

Since the Q2 Artifact API test I just added does the same thing.

fedarko · 2020-04-09T02:12:39Z

Ok, I've added an Artifact API test for the QZV generation. Currently the test just checks that the QZV was generated without errors, although we can of course add more detailed things like HTML checks in the future (that might be sort of out of scope of this PR, though).

Let me know what you think -- it's exciting to get this code formally tested!

antgonza

Thanks @fedarko! @ElDeveloper, could you take a look and merge if it looks fine to you?

ElDeveloper

Thanks @fedarko and @antgonza, looks good just a few comments.

ElDeveloper · 2020-04-10T19:20:37Z

empress/_plot.py

    # TODO: do not ignore the feature metadata when specified by the user
    if feature_metadata is not None:
        feature_metadata = feature_metadata.to_dataframe()

+    sample_metadata = sample_metadata.to_dataframe()


@antgonza do you know of other ways to handle metadata that would be more memory efficient? This is roughly the same way that we handle metadata in Emperor and other plugins. Like @fedarko says, the real problem is likely going to be with the feature metadata.

tests/python/test_integration.py

ElDeveloper · 2020-04-10T19:31:38Z

empress/tools.py

+            "feature table."
+        )
+    # Report to user about any dropped samples from s. metadata and/or table
+    print_if_dropped(


Any motivation for using print statements instead of using warnings.warn? I think a warning here would make more sense, plus testing is fairly straight-forward with assertWarnsRegex.

I don't have a particularly good motivation for this -- printing was just how I did this in Qurro :)

It's possible to get warnings.warn() to go to stdout, right? I'd want to make sure that (if the user passes --verbose to Empress) they see these messages show up.

Gotcha. Yes, it is possible to see warnings with the --verbose flag.

empress/tools.py

ElDeveloper · 2020-04-10T19:38:35Z

empress/tools.py

+
+    # Match table and sample metadata
+    sample_metadata_t = sample_metadata.T
+    sf_ff_table, sf_sample_metadata_t = ff_table.align(


Does this mean that samples in the feature table not present in the metadata are going to be dropped? If so, I would suggest only doing this if the user explicitly asks for this, for example with a flag --ignore-missing-samples or something along those lines.

Yes, this means that the default behavior of Empress right now is "filtering" the table and sample metadata to just the shared samples.

Just to check, would you be ok with the following solution:

By default, if any samples in the table have no metadata (or any samples in the metadata are not in the table), raise an error explaining the situation

If the user passes a --filter-missing-samples flag or something, do the current behavior (filter to samples shared btwn. table and metadata, so long as there's at least 1 such sample)

This would be slightly different from Emperor's --ignore-missing-samples flag, hence the slightly different name to avoid confusing users.

We could also implement an analogue to what Emperor's --ignore-missing-samples flag does, where we allow for extra samples in the metadata to not be in the table by default, but raise errors that need to be manually overridden when the table contains samples not in the metadata. However I'm not sure this would be super useful, because samples without metadata are kind of useless in the Empress visualization IMO? Unlike in Emperor, where those samples are still "represented" in the visualization. The first solution seems more intuitive to me for Empress' utility.

...Sorry for the rambles -- we can talk more about this over a call if you'd prefer :)

Thanks for the explanation @fedarko.

We could also implement an analogue to what Emperor's --ignore-missing-samples flag does, where we allow for extra samples in the metadata to not be in the table by default, but raise errors that need to be manually overridden when the table contains samples not in the metadata.

The motivation is to always show the users what samples are lacking metadata, and do so clearly. For example in Emperor when a sample does not have metadata and the user selectes to --ignore-missing-samples these samples are padded with "placeholder metadata" that shows what is missing. For example all columns in the metadata for those samples would have "This sample has no metadata". I think it's better to implement something analogous to --ignore-missing-samples.

Got it. I can implement something analogous to what Emperor does -- will take a bit of extra time, but shouldn't be too bad :) Just to double check, this means that samples that are in the metadata but not in the table are dropped by default? (I can set things up to warn the user about these samples but still not do anything.)

Agreed that consistency here would be good, especially between the two Emp[eror|ress] tools :) As a heads up, Qurro does things a slightly different way: it filters its inputs so that all the samples that remain are the "matches" between the table and metadata, outputting warnings/print messages about these operations as needed. (...If I could go back a year and change things, I'd probably make this more similar to how Emperor works. Sorry!)

Yes, in general the metadata and tree can be considered a supersets of the feature table, but not the other way around. The flow should be:

if there's tips in the table that are not present in the tree: raise an error and allow to override with "filter missing features" flag # when the user sets the "filter missing features" flag, these features # are removed from the table. if there's samples in table not present in the metadata: raise an error and allow to override with "ignore missing samples" flag. # when the user sets the "ignore missing samples" flag these sample's # metadata is padded with a message "This sample has no metadata".

For reference, here's how things are done in Emperor.

Regarding Qurro, there's always a chance for a new release :)

Thanks for the detailed writeup! This makes things clearer. I'll try to get to this soon. Will also see if I can add biocore/qurro#296 in for the next Qurro release ;)

tests/python/test_integration.py

@ElDeveloper

Addresses comment from @ElDeveloper

Co-Authored-By: Yoshiki Vázquez Baeza <[email protected]>

@ElDeveloper

Tests haven't been updated yet -- will do so when --ignore-missing-samples option added in. (So this will currently break the tests.) This represents part of the work on addressing @ElDeveloper's comments on biocore#154.

…o matching-fix

@ElDeveloper

Per suggestion from @ElDeveloper in biocore#154

Just for now. When we resolve biocore#140, we should add these instructions back in (likely we'll also have to adjust these when we get to the 'initial release' of Empress on PyPI / conda-forge / etc.)

This entailed substantial restructuring of match_inputs(). I also completely deleted warn_if_dropped(), because it was honestly easier to replace it with custom error messages for each of its 3 usages. (Also, that thing was like 50 lines of docstring / infrastructure for 8 lines of code. It was gnarly. :P) This isn't done yet! I still need to test this new behavior thoroughly, and to update the tests for the old functionality accordingly.

will add more back (with relevant changes to work with new behavior) soon

also fixed a bug in prev test i just added in, and removed extraneous comment

I think I'm satisfied with the new matching behavior tests, at least for now

fedarko · 2020-04-18T00:45:39Z

@ElDeveloper Sorry for the wait! I've updated the matching behavior to more closely resemble Emperor's, and I've added decently thorough tests that check the related cases.

Some related screenshots, for fun:

One of the possible errors

Some possible warnings (shown when you use the `--verbose` flag)

Placeholder metadata in practice

These changes should take care of things -- at least until we add in feature metadata support to Empress. (And feature metadata is an optional argument to match_inputs(), so we shouldn't have to modify these tests at all when we add in feature metadata matching in the future. (Knock on 🌲, though.))

Thanks!

ElDeveloper

Just one quick comment!

README.md

empress/tools.py

@ElDeveloper

See new comment for justification. Addresses comment from @ElDeveloper.

ElDeveloper · 2020-04-20T18:44:14Z

Thanks so much @fedarko!

fedarko added 11 commits April 7, 2020 10:47

BUG/TST: Add back in data matching/checking code

1619a80

Closes biocore#139, for real this time. Eventually we'll need to check that feature metadata matches up, but that is its own problem for later down the road.

STY: fix flake8 complaint

ce6942f

DOC: typo fix [ci skip]

130f304

STY: rm extra blank line

e78eca9

TST: rename a prev matching test and add skeletons

a526e9d

TST: Add "no features shared" test for matching

571ef4a

part of biocore#139 fixes

TST: test a warning msg printed during matching

20f01c1

TST: Add sample dropping warning test

6f33e48

think this pr should be good for now

DOC: add note to match_inputs() re biocore#130 (TODO)

8a53b44

antgonza reviewed Apr 8, 2020

View reviewed changes

fedarko mentioned this pull request Apr 8, 2020

Make Empress.Tree subclass bp.Tree rather than skbio.TreeNode #155

Closed

fedarko added 3 commits April 8, 2020 17:47

TST: Install and use QIIME 2 env in travis build

e3aad19

TST: Add actual Q2 integration test!

a6f4284

Addresses @ElDeveloper's comment on biocore#154. I'm keeping 'make docs' around since it could still be nifty (if you just wanna regenerate the empress-tree.qzv file without rerunning the tests, I guess).

TST: don't run 'make docs' on travis build

b711350

Since the Q2 Artifact API test I just added does the same thing.

antgonza approved these changes Apr 9, 2020

View reviewed changes

ElDeveloper requested changes Apr 10, 2020

View reviewed changes

fedarko and others added 6 commits April 10, 2020 12:50

TST: Add rough Q2 visualization check biocore#154

8480b52

Addresses comment from @ElDeveloper

STY: Remove blank lines in match_inputs()

a1a420b

Co-Authored-By: Yoshiki Vázquez Baeza <[email protected]>

STY: more blank line removals in docstring

b10259a

Co-Authored-By: Yoshiki Vázquez Baeza <[email protected]>

STY: rm blank lines in print_if_dropped docstring

b41b34c

Co-Authored-By: Yoshiki Vázquez Baeza <[email protected]>

MNT: warn instead of printing re: sample dropping

9971a0e

Tests haven't been updated yet -- will do so when --ignore-missing-samples option added in. (So this will currently break the tests.) This represents part of the work on addressing @ElDeveloper's comments on biocore#154.

Merge branch 'matching-fix' of https://github.com/fedarko/empress int…

4a4e79d

…o matching-fix

fedarko mentioned this pull request Apr 10, 2020

Adjust matching to be more consistent to Emperor/Empress biocore/qurro#296

Open

fedarko added 2 commits April 13, 2020 14:59

ENH: add UI skeleton for no-data sample/feat flags

e1d640c

Per suggestion from @ElDeveloper in biocore#154

STY: make _plot inputs prettier

a5167f0

fedarko added 15 commits April 13, 2020 15:06

DOC: add ref to emperor --ignore-missing-samples

355c3ed

DOC: Remove 'standalone' instructions in README

01ff29b

Just for now. When we resolve biocore#140, we should add these instructions back in (likely we'll also have to adjust these when we get to the 'initial release' of Empress on PyPI / conda-forge / etc.)

DOC: switch feature/sample flag order, imprv docs

ed9c177

MNT: Avoid redundant table DF transpositions biocore#155

427dbb4

BUG: don't display useless warning in most cases

876703e

TST: reduce tests to just one working one

2053984

will add more back (with relevant changes to work with new behavior) soon

DOC: add TODO note re empty checking

2ea4052

TST: add back "simple" matching error tests

9fc25f9

TST: add + beef up tests of matching warnings, etc

1ec7604

TST: add --p-ignore-missing-samples tests

500b589

TST: add another cornercase test

2741ba5

TST: test final "warning" in matching func for now

cfa95ce

also fixed a bug in prev test i just added in, and removed extraneous comment

TST: Add other check for extra s.m. sample warning

1fc4e2e

I think I'm satisfied with the new matching behavior tests, at least for now

DOC: update example QZV :)

b6fd9a0

ElDeveloper approved these changes Apr 20, 2020

View reviewed changes

README.md Show resolved Hide resolved

empress/tools.py Outdated Show resolved Hide resolved

MNT: don't warn on dropped samples from s.metadata

b409d59

See new comment for justification. Addresses comment from @ElDeveloper.

ElDeveloper merged commit 2bd92b3 into biocore:master Apr 20, 2020

fedarko mentioned this pull request Apr 20, 2020

Supporting running Empress outside of QIIME 2? #140

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Match" table, tree, and sample metadata, and verify that things seem ok #154

"Match" table, tree, and sample metadata, and verify that things seem ok #154

fedarko commented Apr 8, 2020

antgonza left a comment

antgonza Apr 8, 2020

fedarko Apr 8, 2020

ElDeveloper Apr 10, 2020

antgonza Apr 10, 2020

ElDeveloper Apr 10, 2020

fedarko commented Apr 8, 2020

ElDeveloper commented Apr 9, 2020

fedarko commented Apr 9, 2020

antgonza left a comment

ElDeveloper left a comment

ElDeveloper Apr 10, 2020

ElDeveloper Apr 10, 2020

fedarko Apr 10, 2020

ElDeveloper Apr 10, 2020

ElDeveloper Apr 10, 2020

fedarko Apr 10, 2020 •

edited

Loading

ElDeveloper Apr 10, 2020

fedarko Apr 10, 2020

ElDeveloper Apr 10, 2020

fedarko Apr 10, 2020

fedarko commented Apr 18, 2020

ElDeveloper left a comment

ElDeveloper commented Apr 20, 2020

"Match" table, tree, and sample metadata, and verify that things seem ok #154

"Match" table, tree, and sample metadata, and verify that things seem ok #154

Conversation

fedarko commented Apr 8, 2020

antgonza left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fedarko commented Apr 8, 2020

ElDeveloper commented Apr 9, 2020

fedarko commented Apr 9, 2020

antgonza left a comment

Choose a reason for hiding this comment

ElDeveloper left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fedarko Apr 10, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fedarko commented Apr 18, 2020

One of the possible errors

Some possible warnings (shown when you use the --verbose flag)

Placeholder metadata in practice

ElDeveloper left a comment

Choose a reason for hiding this comment

ElDeveloper commented Apr 20, 2020

fedarko Apr 10, 2020 •

edited

Loading

Some possible warnings (shown when you use the `--verbose` flag)