Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EntitiesList update #163

Merged
merged 23 commits into from
Mar 25, 2022
Merged

EntitiesList update #163

merged 23 commits into from
Mar 25, 2022

Conversation

stschiff
Copy link
Member

@stschiff stschiff commented Mar 8, 2022

Adresses #161 and implements some stuff needed for poseidon-analysis-hs.

@codecov-commenter
Copy link

codecov-commenter commented Mar 9, 2022

Codecov Report

Merging #163 (bd57928) into master (796a98e) will increase coverage by 0.14%.
The diff coverage is 86.84%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #163      +/-   ##
==========================================
+ Coverage   70.02%   70.17%   +0.14%     
==========================================
  Files          18       18              
  Lines        2045     2025      -20     
  Branches      144      147       +3     
==========================================
- Hits         1432     1421      -11     
+ Misses        469      457      -12     
- Partials      144      147       +3     
Impacted Files Coverage Δ
src/Poseidon/Janno.hs 73.25% <0.00%> (ø)
src/Poseidon/CLI/Fetch.hs 62.59% <70.96%> (-0.61%) ⬇️
src/Poseidon/CLI/Forge.hs 69.23% <86.95%> (-8.94%) ⬇️
src/Poseidon/EntitiesList.hs 94.11% <96.22%> (+13.16%) ⬆️
src/Poseidon/GenotypeData.hs 59.88% <100.00%> (+0.24%) ⬆️
src/Poseidon/Package.hs 78.74% <100.00%> (+0.23%) ⬆️
src/Poseidon/SecondaryTypes.hs 30.64% <0.00%> (+9.67%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 796a98e...bd57928. Read the comment docs.

@stschiff stschiff marked this pull request as ready for review March 10, 2022 08:09
@stschiff stschiff requested a review from nevrome March 10, 2022 08:09
@stschiff
Copy link
Member Author

stschiff commented Mar 10, 2022

Alright, this has been a more substantial PR than I anticipated. I significantly changed the EntitiesList API, which is needed to harmonise usage with xerxes and reduce code duplications. There is now a signed and an unsigned version of Entities, and the implementation uses a class to use polymorphic functions that work with both signed and unsigned lists. Signed Entities are important for trident forge and xerxes fstats and xerxes ras, but unsigned entities are important for trident fetch.

Implementation-wise, I should add that I make quite heavy use of the IndividualInfo data structure, which is useful because it stores exactly the three levels of entities per individual that are important for querying: Individual names, group names and package names.

There is now a new feature in trident fetch, that you can also ask for groups and individuals, and the correct packages are being loaded.

@nevrome
Copy link
Member

nevrome commented Mar 10, 2022

Ok! Cool! This is a frighteningly big PR affecting the heart of trident. Before I start the review I have some general shower thoughts:

  1. We should use this opportunity of a major change to forge to replace the --eigenstrat flag with the more proper, flexible and future-proof --outFormat option used in genoConvert
  2. This PR should be accompanied with a PR to the documentation/website repo to properly explain the new EntitiesList semantics
  3. As this is a severe change that will potentially break the workflows of people using the current selection syntax (e.g. @neija2611) we should add a very obvious warning in case negative selection is used. Something like this:

The semantics of --forgeString and --forgeFile have been changed in trident v0.X.X. Removing samples, groups or packages now follows a different logic. Please see the documentation here (http://...) to verify that your selection still behaves as you expect.

Copy link
Member

@nevrome nevrome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found it difficult to review that. Hard to grasp the ramification of every change. I like a lot of the changes and left some minor comments on the aspects I don't.

I did not test the new build yet.

@stschiff
Copy link
Member Author

Yes, I apologise for the substantial changes. I think it may be easiest to just go through Forge and Fetch, and then go top-down.

I also think that in cases such as this, our Golden Tests are just awesome. They are the true guarantee that we haven't screwed something up.

@stschiff
Copy link
Member Author

OK, there is another problem I've realised. Right now, entities are read both from the CLI (entitiesDirect) and given file-paths. Implicitly, they are read in order

    entitiesFromFile <- mapM readEntitiesFromFile entitiesFile
    let entitiesInput = nub $ entitiesDirect ++ concat entitiesFromFile

But that of course may have unintended consequences given the new order-dependency.

My radical suggestion: Simplify the CLI-API to either an ordered signed entity list given via the command line, or via a single file. So do not allow multiple files, do not allow both CLI and file input. In my view, there is hardly a use-case for either multiple entity files or both CLI input and files, so I don't think this will harm anyone.

I'm happy to make that change, but wanted to quickly check with you.

@stschiff
Copy link
Member Author

(and I've also now changed the behaviour of forge as we've discussed, if the list starts with an exclude)

@nevrome
Copy link
Member

nevrome commented Mar 13, 2022

My radical suggestion: Simplify the CLI-API to either an ordered signed entity list given via the command line, or via a single file. So do not allow multiple files, do not allow both CLI and file input. In my view, there is hardly a use-case for either multiple entity files or both CLI input and files, so I don't think this will harm anyone.

Ja - ok. I see how this can easily become confusing. Go for it. 👍

@nevrome
Copy link
Member

nevrome commented Mar 13, 2022

Alright - I'm waiting for your OK now to run some practical tests with the new interface.

nevrome added 3 commits March 14, 2022 15:26
…ich then also needed some tweaking to be useful for both forge and genoconvert (with and without default)
@nevrome nevrome mentioned this pull request Mar 14, 2022
@nevrome
Copy link
Member

nevrome commented Mar 15, 2022

Ok - I ran some tests and I think I found somethings that may be a bug - or at least unexpected behaviour. I tested this with a clone of the published_data repository.

This works fine:

$ trident forge -d . --forgeString "*2014_RaghavanNature*" -o huhu
...
Forging with the following entity-list: [*2014_RaghavanNature*]
Searching POSEIDON.yml files... 151 found
Checking Poseidon versions... 
Initializing packages... 
> 151 
Packages loaded: 151
1 packages contain data for this forging operation
...

But excluding the same package has an unforeseen effect:

$ trident forge -d . --forgeString "-*2014_RaghavanNature*" -o huhu
...
Forging with the following entity-list: [-*2014_RaghavanNature*]
Searching POSEIDON.yml files... 151 found
Checking Poseidon versions... 
Initializing packages... 
> 151 
Packages loaded: 151
forge entities begin with exclude or are empty, so implicitly adding all packages as includes before applying excludes.
The following entities do not exist in this dataset and will be ignored: *2014_RaghavanNature*
150 packages contain data for this forging operation
...

So we seem to get the right selection, but with a confusing error message:

The following entities do not exist in this dataset and will be ignored: 2014_RaghavanNature

When I first read this, I was convinced I must have made a typo in the package name.

This seems to happen only if the first entry in the list is an Exclude Pac.

@nevrome
Copy link
Member

nevrome commented Mar 16, 2022

When practically working with this version, I also find the Forging with the following entity-list: ... output pretty annoying. I have 3000 individuals in my forgeFile, so I'm presented with a wall of text. One way to solve this would be to truncate after 3-10 entities.

@stschiff
Copy link
Member Author

All good points. I'm on it.

@stschiff
Copy link
Member Author

I think it should be fixed now.

@nevrome
Copy link
Member

nevrome commented Mar 18, 2022

Safe for a small adjustment of the documentation on the website this good to go, imho.

@stschiff
Copy link
Member Author

OK, I'll merge this then and adjust the documentation

@stschiff stschiff merged commit 4977f62 into master Mar 25, 2022
@stschiff stschiff deleted the pimp_entitieslist branch March 25, 2022 17:06
@nevrome nevrome mentioned this pull request May 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants