-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Match" table, tree, and sample metadata, and verify that things seem ok #154
Conversation
Closes biocore#139, for real this time. Eventually we'll need to check that feature metadata matches up, but that is its own problem for later down the road.
Not sure why these files weren't here before, but this will make rerunning the tutorial easy. Also "make docs" is just a shorthand that saves extra typing when re-visualizing the moving pictures tree. We could integrate this into the travis build in the future if desired (of course this would be predicated on us getting QIIME 2 set up in the travis build, which would add on a few minutes to each build due to Q2 installation taking some time).
So apparently QIIME 2's transformers from biom table -> pd DataFrame produce DFs that are transposed from what biom.Table does -- QIIME 2 uses samples as the indices (rows) and features as the columns, while biom.Table does it the other way around. As you can imagine, this is pretty confusing! This commit should fix this problem from our end, but in the future we should really add logic to prevent having to do table-DF-transposition, since IIRC that can be super slow with massive DFs. (...We really oughta unit-test _plot.)
think this pr should be good for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @fedarko, I think this is important.
A few comments; also perhaps worth adding (fine to open as issues if not part of this PR):
- adding/running flake8 to the review the code, currently is being installed but it not being used to review the code
- adding the coverage batch in the README
- if you have the qza files in the repo, perhaps worth adding a call to generate the empress.qzv at the end of the build?
# TODO: do not ignore the feature metadata when specified by the user | ||
if feature_metadata is not None: | ||
feature_metadata = feature_metadata.to_dataframe() | ||
|
||
sample_metadata = sample_metadata.to_dataframe() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a large mapping file is passed, this is going to be super expensive, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO the main bottleneck with huge sample metadata in Empress will be in having the browser load it into memory, rather than in the python side of things.
That being said, yeah, it would make sense to avoid or delegate the qiime2.Metadata
-> pd.DataFrame
conversion if possible. I'm hesitant to make changes to this here because this code isn't part of this PR, but happy to open an issue if you'd like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! I was thinking in not loading the full sample metadata dataframe into memory but use the qiime2 methods to retrieve what you need; for example:
Metadata.get_column
MetadataColumn.to_series
etc
However, not sure if this will help the footprint or not ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, my understanding is that Metadata
parses and loads the whole mapping file into memory.
Thanks for the review @antgonza! Addressing your three comments:
This is already being done -- see the Travis log for this PR as of writing. The
This is a really good idea, and will help a lot with figuring out where the "holes" are in our test suite (e.g. #142). Opened an issue for this in #156.
I'm down to add this, but as mentioned this will automatically increase build times by at least a few minutes (installing QIIME 2 takes some time). @ElDeveloper / @kwcantrell what do you think? |
In general I agree, however I would suggest testing the QZV generation via Python. And installing QIIME2 should be fine. We can test the plugin like it's done in q2-diversity: |
Addresses @ElDeveloper's comment on biocore#154. I'm keeping 'make docs' around since it could still be nifty (if you just wanna regenerate the empress-tree.qzv file without rerunning the tests, I guess).
Since the Q2 Artifact API test I just added does the same thing.
Ok, I've added an Artifact API test for the QZV generation. Currently the test just checks that the QZV was generated without errors, although we can of course add more detailed things like HTML checks in the future (that might be sort of out of scope of this PR, though). Let me know what you think -- it's exciting to get this code formally tested! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @fedarko! @ElDeveloper, could you take a look and merge if it looks fine to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# TODO: do not ignore the feature metadata when specified by the user | ||
if feature_metadata is not None: | ||
feature_metadata = feature_metadata.to_dataframe() | ||
|
||
sample_metadata = sample_metadata.to_dataframe() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
empress/tools.py
Outdated
"feature table." | ||
) | ||
# Report to user about any dropped samples from s. metadata and/or table | ||
print_if_dropped( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any motivation for using print statements instead of using warnings.warn
? I think a warning here would make more sense, plus testing is fairly straight-forward with assertWarnsRegex
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a particularly good motivation for this -- printing was just how I did this in Qurro :)
It's possible to get warnings.warn()
to go to stdout, right? I'd want to make sure that (if the user passes --verbose
to Empress) they see these messages show up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. Yes, it is possible to see warnings with the --verbose
flag.
empress/tools.py
Outdated
|
||
# Match table and sample metadata | ||
sample_metadata_t = sample_metadata.T | ||
sf_ff_table, sf_sample_metadata_t = ff_table.align( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that samples in the feature table not present in the metadata are going to be dropped? If so, I would suggest only doing this if the user explicitly asks for this, for example with a flag --ignore-missing-samples
or something along those lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this means that the default behavior of Empress right now is "filtering" the table and sample metadata to just the shared samples.
Just to check, would you be ok with the following solution:
- By default, if any samples in the table have no metadata (or any samples in the metadata are not in the table), raise an error explaining the situation
- If the user passes a
--filter-missing-samples
flag or something, do the current behavior (filter to samples shared btwn. table and metadata, so long as there's at least 1 such sample)
This would be slightly different from Emperor's --ignore-missing-samples
flag, hence the slightly different name to avoid confusing users.
We could also implement an analogue to what Emperor's --ignore-missing-samples
flag does, where we allow for extra samples in the metadata to not be in the table by default, but raise errors that need to be manually overridden when the table contains samples not in the metadata. However I'm not sure this would be super useful, because samples without metadata are kind of useless in the Empress visualization IMO? Unlike in Emperor, where those samples are still "represented" in the visualization. The first solution seems more intuitive to me for Empress' utility.
...Sorry for the rambles -- we can talk more about this over a call if you'd prefer :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation @fedarko.
We could also implement an analogue to what Emperor's --ignore-missing-samples flag does, where we allow for extra samples in the metadata to not be in the table by default, but raise errors that need to be manually overridden when the table contains samples not in the metadata.
The motivation is to always show the users what samples are lacking metadata, and do so clearly. For example in Emperor when a sample does not have metadata and the user selectes to --ignore-missing-samples
these samples are padded with "placeholder metadata" that shows what is missing. For example all columns in the metadata for those samples would have "This sample has no metadata". I think it's better to implement something analogous to --ignore-missing-samples
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I can implement something analogous to what Emperor does -- will take a bit of extra time, but shouldn't be too bad :) Just to double check, this means that samples that are in the metadata but not in the table are dropped by default? (I can set things up to warn the user about these samples but still not do anything.)
Agreed that consistency here would be good, especially between the two Emp[eror|ress] tools :) As a heads up, Qurro does things a slightly different way: it filters its inputs so that all the samples that remain are the "matches" between the table and metadata, outputting warnings/print messages about these operations as needed. (...If I could go back a year and change things, I'd probably make this more similar to how Emperor works. Sorry!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in general the metadata and tree can be considered a supersets of the feature table, but not the other way around. The flow should be:
if there's tips in the table that are not present in the tree:
raise an error and allow to override with "filter missing features" flag
# when the user sets the "filter missing features" flag, these features
# are removed from the table.
if there's samples in table not present in the metadata:
raise an error and allow to override with "ignore missing samples" flag.
# when the user sets the "ignore missing samples" flag these sample's
# metadata is padded with a message "This sample has no metadata".
For reference, here's how things are done in Emperor.
Regarding Qurro, there's always a chance for a new release :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed writeup! This makes things clearer. I'll try to get to this soon. Will also see if I can add biocore/qurro#296 in for the next Qurro release ;)
Addresses comment from @ElDeveloper
Co-Authored-By: Yoshiki Vázquez Baeza <[email protected]>
Co-Authored-By: Yoshiki Vázquez Baeza <[email protected]>
Co-Authored-By: Yoshiki Vázquez Baeza <[email protected]>
Tests haven't been updated yet -- will do so when --ignore-missing-samples option added in. (So this will currently break the tests.) This represents part of the work on addressing @ElDeveloper's comments on biocore#154.
Just for now. When we resolve biocore#140, we should add these instructions back in (likely we'll also have to adjust these when we get to the 'initial release' of Empress on PyPI / conda-forge / etc.)
This entailed substantial restructuring of match_inputs(). I also completely deleted warn_if_dropped(), because it was honestly easier to replace it with custom error messages for each of its 3 usages. (Also, that thing was like 50 lines of docstring / infrastructure for 8 lines of code. It was gnarly. :P) This isn't done yet! I still need to test this new behavior thoroughly, and to update the tests for the old functionality accordingly.
will add more back (with relevant changes to work with new behavior) soon
also fixed a bug in prev test i just added in, and removed extraneous comment
I think I'm satisfied with the new matching behavior tests, at least for now
@ElDeveloper Sorry for the wait! I've updated the matching behavior to more closely resemble Emperor's, and I've added decently thorough tests that check the related cases. Some related screenshots, for fun: One of the possible errorsSome possible warnings (shown when you use the
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one quick comment!
See new comment for justification. Addresses comment from @ElDeveloper.
Thanks so much @fedarko! |
Closes #139. Now, the sort of case described there will trigger the following error:
We will still need to match/validate feature metadata (e.g. taxonomy) when we add support for that in to Empress (#130) -- I haven't done that in this PR, but I can add it in if you'd like. (Wasn't sure how we'd prefer to handle feature metadata, so I haven't included this here.)
Also changed in this PR: I moved the QZAs/etc. needed for the moving pictures example back into the repository (in
docs/moving-pictures/
), and added amake docs
command that'll re-generate theempress-tree.qzv
visualization linked from the README. We could make runningmake docs
part of the Travis build in the future if desired, as a de facto "integration test" (of course, this would require us to have QIIME 2 installed on Travis, which would slow down builds by a few minutes).Thanks! Let me know if you'd like to go over this over Zoom or whatever sometime.