Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Match" table, tree, and sample metadata, and verify that things seem ok #154

Merged
merged 38 commits into from
Apr 20, 2020

Conversation

fedarko
Copy link
Collaborator

@fedarko fedarko commented Apr 8, 2020

Closes #139. Now, the sort of case described there will trigger the following error:

image

We will still need to match/validate feature metadata (e.g. taxonomy) when we add support for that in to Empress (#130) -- I haven't done that in this PR, but I can add it in if you'd like. (Wasn't sure how we'd prefer to handle feature metadata, so I haven't included this here.)

Also changed in this PR: I moved the QZAs/etc. needed for the moving pictures example back into the repository (in docs/moving-pictures/), and added a make docs command that'll re-generate the empress-tree.qzv visualization linked from the README. We could make running make docs part of the Travis build in the future if desired, as a de facto "integration test" (of course, this would require us to have QIIME 2 installed on Travis, which would slow down builds by a few minutes).

Thanks! Let me know if you'd like to go over this over Zoom or whatever sometime.

fedarko added 11 commits April 7, 2020 10:47
Closes biocore#139, for real this time.

Eventually we'll need to check that feature metadata matches up, but
that is its own problem for later down the road.
Not sure why these files weren't here before, but this will make
rerunning the tutorial easy.

Also "make docs" is just a shorthand that saves extra typing when
re-visualizing the moving pictures tree. We could integrate this into
the travis build in the future if desired (of course this would be
predicated on us getting QIIME 2 set up in the travis build, which
would add on a few minutes to each build due to Q2 installation taking
some time).
So apparently QIIME 2's transformers from biom table -> pd DataFrame
produce DFs that are transposed from what biom.Table does -- QIIME 2
uses samples as the indices (rows) and features as the columns,
while biom.Table does it the other way around.

As you can imagine, this is pretty confusing! This commit should fix
this problem from our end, but in the future we should really add
logic to prevent having to do table-DF-transposition, since IIRC that
can be super slow with massive DFs.

(...We really oughta unit-test _plot.)
think this pr should be good for now
Copy link
Collaborator

@antgonza antgonza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @fedarko, I think this is important.

A few comments; also perhaps worth adding (fine to open as issues if not part of this PR):

  • adding/running flake8 to the review the code, currently is being installed but it not being used to review the code
  • adding the coverage batch in the README
  • if you have the qza files in the repo, perhaps worth adding a call to generate the empress.qzv at the end of the build?

# TODO: do not ignore the feature metadata when specified by the user
if feature_metadata is not None:
feature_metadata = feature_metadata.to_dataframe()

sample_metadata = sample_metadata.to_dataframe()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a large mapping file is passed, this is going to be super expensive, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the main bottleneck with huge sample metadata in Empress will be in having the browser load it into memory, rather than in the python side of things.

That being said, yeah, it would make sense to avoid or delegate the qiime2.Metadata -> pd.DataFrame conversion if possible. I'm hesitant to make changes to this here because this code isn't part of this PR, but happy to open an issue if you'd like.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antgonza do you know of other ways to handle metadata that would be more memory efficient? This is roughly the same way that we handle metadata in Emperor and other plugins. Like @fedarko says, the real problem is likely going to be with the feature metadata.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! I was thinking in not loading the full sample metadata dataframe into memory but use the qiime2 methods to retrieve what you need; for example:
Metadata.get_column
MetadataColumn.to_series
etc

However, not sure if this will help the footprint or not ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, my understanding is that Metadata parses and loads the whole mapping file into memory.

empress/_plot.py Show resolved Hide resolved
empress/_plot.py Show resolved Hide resolved
empress/tools.py Show resolved Hide resolved
@fedarko
Copy link
Collaborator Author

fedarko commented Apr 8, 2020

Thanks for the review @antgonza!

Addressing your three comments:

adding/running flake8 to the review the code, currently is being installed but it not being used to review the code

This is already being done -- see the Travis log for this PR as of writing. The make stylecheck directive (all the way down at the bottom of the log) is running flake8 to check the python code, as well as some JS libraries to check that half of the codebase.

adding the coverage batch in the README

This is a really good idea, and will help a lot with figuring out where the "holes" are in our test suite (e.g. #142). Opened an issue for this in #156.

if you have the qza files in the repo, perhaps worth adding a call to generate the empress.qzv at the end of the build?

I'm down to add this, but as mentioned this will automatically increase build times by at least a few minutes (installing QIIME 2 takes some time). @ElDeveloper / @kwcantrell what do you think?

@ElDeveloper
Copy link
Member

if you have the qza files in the repo, perhaps worth adding a call to generate the empress.qzv at the end of the build?
I'm down to add this, but as mentioned this will automatically increase build times by at least a few minutes (installing QIIME 2 takes some time). @ElDeveloper / @kwcantrell what do you think?

In general I agree, however I would suggest testing the QZV generation via Python. And installing QIIME2 should be fine. We can test the plugin like it's done in q2-diversity:

https://github.com/qiime2/q2-diversity/blob/master/q2_diversity/tests/test_beta_correlation.py

fedarko added 3 commits April 8, 2020 17:47
Addresses @ElDeveloper's comment on biocore#154.

I'm keeping 'make docs' around since it could still be nifty (if you
just wanna regenerate the empress-tree.qzv file without rerunning
the tests, I guess).
Since the Q2 Artifact API test I just added does the same thing.
@fedarko
Copy link
Collaborator Author

fedarko commented Apr 9, 2020

Ok, I've added an Artifact API test for the QZV generation. Currently the test just checks that the QZV was generated without errors, although we can of course add more detailed things like HTML checks in the future (that might be sort of out of scope of this PR, though).

Let me know what you think -- it's exciting to get this code formally tested!

Copy link
Collaborator

@antgonza antgonza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @fedarko! @ElDeveloper, could you take a look and merge if it looks fine to you?

Copy link
Member

@ElDeveloper ElDeveloper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @fedarko and @antgonza, looks good just a few comments.

# TODO: do not ignore the feature metadata when specified by the user
if feature_metadata is not None:
feature_metadata = feature_metadata.to_dataframe()

sample_metadata = sample_metadata.to_dataframe()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antgonza do you know of other ways to handle metadata that would be more memory efficient? This is roughly the same way that we handle metadata in Emperor and other plugins. Like @fedarko says, the real problem is likely going to be with the feature metadata.

tests/python/test_integration.py Outdated Show resolved Hide resolved
empress/tools.py Outdated
"feature table."
)
# Report to user about any dropped samples from s. metadata and/or table
print_if_dropped(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any motivation for using print statements instead of using warnings.warn? I think a warning here would make more sense, plus testing is fairly straight-forward with assertWarnsRegex.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a particularly good motivation for this -- printing was just how I did this in Qurro :)

It's possible to get warnings.warn() to go to stdout, right? I'd want to make sure that (if the user passes --verbose to Empress) they see these messages show up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. Yes, it is possible to see warnings with the --verbose flag.

empress/tools.py Outdated Show resolved Hide resolved
empress/tools.py Outdated Show resolved Hide resolved
empress/tools.py Outdated Show resolved Hide resolved
empress/tools.py Outdated

# Match table and sample metadata
sample_metadata_t = sample_metadata.T
sf_ff_table, sf_sample_metadata_t = ff_table.align(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that samples in the feature table not present in the metadata are going to be dropped? If so, I would suggest only doing this if the user explicitly asks for this, for example with a flag --ignore-missing-samples or something along those lines.

Copy link
Collaborator Author

@fedarko fedarko Apr 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this means that the default behavior of Empress right now is "filtering" the table and sample metadata to just the shared samples.

Just to check, would you be ok with the following solution:

  • By default, if any samples in the table have no metadata (or any samples in the metadata are not in the table), raise an error explaining the situation
  • If the user passes a --filter-missing-samples flag or something, do the current behavior (filter to samples shared btwn. table and metadata, so long as there's at least 1 such sample)

This would be slightly different from Emperor's --ignore-missing-samples flag, hence the slightly different name to avoid confusing users.

We could also implement an analogue to what Emperor's --ignore-missing-samples flag does, where we allow for extra samples in the metadata to not be in the table by default, but raise errors that need to be manually overridden when the table contains samples not in the metadata. However I'm not sure this would be super useful, because samples without metadata are kind of useless in the Empress visualization IMO? Unlike in Emperor, where those samples are still "represented" in the visualization. The first solution seems more intuitive to me for Empress' utility.

...Sorry for the rambles -- we can talk more about this over a call if you'd prefer :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation @fedarko.

We could also implement an analogue to what Emperor's --ignore-missing-samples flag does, where we allow for extra samples in the metadata to not be in the table by default, but raise errors that need to be manually overridden when the table contains samples not in the metadata.

The motivation is to always show the users what samples are lacking metadata, and do so clearly. For example in Emperor when a sample does not have metadata and the user selectes to --ignore-missing-samples these samples are padded with "placeholder metadata" that shows what is missing. For example all columns in the metadata for those samples would have "This sample has no metadata". I think it's better to implement something analogous to --ignore-missing-samples.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I can implement something analogous to what Emperor does -- will take a bit of extra time, but shouldn't be too bad :) Just to double check, this means that samples that are in the metadata but not in the table are dropped by default? (I can set things up to warn the user about these samples but still not do anything.)

Agreed that consistency here would be good, especially between the two Emp[eror|ress] tools :) As a heads up, Qurro does things a slightly different way: it filters its inputs so that all the samples that remain are the "matches" between the table and metadata, outputting warnings/print messages about these operations as needed. (...If I could go back a year and change things, I'd probably make this more similar to how Emperor works. Sorry!)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in general the metadata and tree can be considered a supersets of the feature table, but not the other way around. The flow should be:

if there's tips in the table that are not present in the tree:
    raise an error and allow to override with "filter missing features" flag
    # when the user sets the "filter missing features" flag, these features
    # are removed from the table.
if there's samples in table not present in the metadata:
    raise an error and allow to override with "ignore missing samples" flag.
    # when the user sets the "ignore missing samples" flag these sample's
    # metadata is padded with a message "This sample has no metadata".

For reference, here's how things are done in Emperor.

Regarding Qurro, there's always a chance for a new release :) :shipit:

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed writeup! This makes things clearer. I'll try to get to this soon. Will also see if I can add biocore/qurro#296 in for the next Qurro release ;)

tests/python/test_integration.py Show resolved Hide resolved
fedarko and others added 6 commits April 10, 2020 12:50
Co-Authored-By: Yoshiki Vázquez Baeza <[email protected]>
Co-Authored-By: Yoshiki Vázquez Baeza <[email protected]>
Tests haven't been updated yet -- will do so when
--ignore-missing-samples option added in. (So this will currently
break the tests.)

This represents part of the work on addressing @ElDeveloper's comments
on biocore#154.
fedarko added 15 commits April 13, 2020 15:06
Just for now. When we resolve biocore#140, we should add these instructions
back in (likely we'll also have to adjust these when we get to the
'initial release' of Empress on PyPI / conda-forge / etc.)
This entailed substantial restructuring of match_inputs(). I also
completely deleted warn_if_dropped(), because it was honestly easier
to replace it with custom error messages for each of its 3 usages.
(Also, that thing was like 50 lines of docstring / infrastructure for
8 lines of code. It was gnarly. :P)

This isn't done yet! I still need to test this new behavior
thoroughly, and to update the tests for the old functionality
accordingly.
will add more back (with relevant changes to work with new behavior)
soon
also fixed a bug in prev test i just added in, and removed extraneous
comment
I think I'm satisfied with the new matching behavior tests, at
least for now
@fedarko
Copy link
Collaborator Author

fedarko commented Apr 18, 2020

@ElDeveloper Sorry for the wait! I've updated the matching behavior to more closely resemble Emperor's, and I've added decently thorough tests that check the related cases.

Some related screenshots, for fun:

One of the possible errors

image

Some possible warnings (shown when you use the --verbose flag)

image

Placeholder metadata in practice

image

These changes should take care of things -- at least until we add in feature metadata support to Empress. (And feature metadata is an optional argument to match_inputs(), so we shouldn't have to modify these tests at all when we add in feature metadata matching in the future. (Knock on 🌲, though.))

Thanks!

Copy link
Member

@ElDeveloper ElDeveloper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one quick comment!

README.md Show resolved Hide resolved
empress/tools.py Outdated Show resolved Hide resolved
See new comment for justification. Addresses comment from
@ElDeveloper.
@ElDeveloper ElDeveloper merged commit 2bd92b3 into biocore:master Apr 20, 2020
@ElDeveloper
Copy link
Member

Thanks so much @fedarko!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Empress' JS gets stuck in the loading animation when none of the samples in the table have any metadata
3 participants