Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

plot_pairwise_average_fst may change the order of the cohorts when using a cohort dict #540

Closed

Conversation

jonbrenas
Copy link
Collaborator

Resolves #539. There might be other functions affected in the same way.

@jonbrenas jonbrenas requested a review from alimanfoo May 24, 2024 09:40
@@ -493,7 +493,9 @@ def plot_pairwise_average_fst(
**kwargs,
):
# setup df
cohort_list = np.unique(fst_df[["cohort1", "cohort2"]].values)
cohort_list = pd.unique(
[c for cl in fst_df[["cohort1", "cohort2"]].values for c in cl]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The for loops in this line look a little funky. What is it supposed to be doing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal is to get the list of all cohorts in the same order they were defined in the cohort dictionary. plot_pairwise_average_fst does not have the list of cohorts or the dictionary as an input so we have to access it in another way. fst_df[["cohort1", "cohort2"]] is a list of list each one containing two cohort names. The last cohort of the dictionary is missing from the cohort1 column and the first is missing from cohort2, hence why they are both accessed. np.unique is able to take a list of list and return every unique value but it sorts them alphabetically breaking the dictionary's order. pd.unique keeps the order but can't deal with a list of lists. The for loops are thus used to go from a list of list of cohorts to a list of cohorts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jonbrenas. What about:

pd.unique(fst_df[["cohort1", "cohort2"]].values.flatten())

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nested list comprehensions always bake my noodle!

I think it's equivalent to this:

cohort_pairs = fst_df[["cohort1", "cohort2"]].values
flattened_cohorts = []
for cohort_pair in cohort_pairs:
	for cohort in cohort_pair:
		flattened_cohorts.append(cohort)

Or like Alistair suggested:

list(fst_df[["cohort1", "cohort2"]].values.flatten())

@alimanfoo
Copy link
Member

Hi @jonbrenas, just to mention I've updated this branch to bring in some test fixes from master.

Copy link

codecov bot commented May 24, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.89%. Comparing base (dc89c9c) to head (3622b97).
Report is 353 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #540      +/-   ##
==========================================
- Coverage   98.61%   95.89%   -2.73%     
==========================================
  Files          38       39       +1     
  Lines        3690     3821     +131     
==========================================
+ Hits         3639     3664      +25     
- Misses         51      157     +106     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ahernank
Copy link
Collaborator

@jonbrenas I've just re-ran the tests here to check they are all good now. It is only the coverage that needs a bit of additional work.

@jonbrenas
Copy link
Collaborator Author

Thanks @ahernank. I am not quite sure what the problem with codecov/project is, though. It looks like there is an indirect change that leaves a line uncovered in dipclust.py but I can't really figure out why that would be.

Copy link
Collaborator

@leehart leehart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we try using flatten() instead of nested list comprehension, just for readability?

@jonbrenas
Copy link
Collaborator Author

jonbrenas commented Jun 18, 2024

Thanks, @leehart. The bottom-left half of the diagram wasn't being filled anymore. I also changed the list comprehension to .flatten(). (I personally find the list comprehension more readable but I see I am in the minority ;) )

@jonbrenas jonbrenas requested a review from leehart June 18, 2024 11:34
@leehart
Copy link
Collaborator

leehart commented Jun 18, 2024

Cool, thanks @jonbrenas !

I see there's suddenly a code coverage failure in CI, which I don't quite understand, which I think we've experienced before. I wonder if it's the product of something random. I can't seem to re-run it with any success. I think in the past I've just had to add more coverage to an unrelated part of the code to satisfy the threshold, but it's annoying.

@jonbrenas
Copy link
Collaborator Author

I think the problem is with dipcluster.py. It might be that the test doesn't cover some of the new functions.

@leehart
Copy link
Collaborator

leehart commented Jun 18, 2024

Yes, I see that anoph/dipclust.py is the file with the biggest "Change %", i.e. -49.40%, with anoph/snp_frq.py having -4.79%.

I'm not sure what these figures mean!?

From https://docs.codecov.com/docs/coverage-percentages

Head coverage percent is the line coverage of all your files in your head commit.
Patch coverage percent is the line coverage of all the lines changed in your commit.
Change percent is the percent that the overall project line coverage has changed from base to head.

I guess if we can somehow claw back -2.72% test coverage, it should pass.

@alimanfoo
Copy link
Member

Hi folks, I've folded this fix into some other maintenance I was doing on the plot_pairwise_fst_function() over in #579. I'll close here and merge in #579 if approved.

@alimanfoo alimanfoo closed this Aug 9, 2024
@alimanfoo alimanfoo added the BMGF-068808 Work supported by BMGF grant INV-068808 (MalariaGEN 2024-2027). label Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BMGF-068808 Work supported by BMGF grant INV-068808 (MalariaGEN 2024-2027).
Projects
None yet
Development

Successfully merging this pull request may close these issues.

plot_pairwise_average_fst may change the order of the cohorts when using a cohort dict
4 participants