Add pyani subsample
command
#135
Labels
enhancement
something we'd like pyani to do that it doesn't already
interface
issues related to how the user tells pyani to do something
method
the issue relates to how results are calculated
Milestone
We could have a
pyani subsample
subcommand that can populate a new input directory of genomes. This would be useful when the total number of available genomes for analysis is large.The kind of structure that the command could take would be along the lines of:
pyani subsample
- basic subcommand-n --num_genomes
- total number of genomes--balance_classes
- if not set, the genomes are selected randomly from those in the input directory; if set, then an attempt is made to balance each class.The way balancing might work is as follows: say there are 200 genomes, and you want to subsample 50. If there are two classes with 100 members each, we'd want to have 25 from each - a random sampling from each would be find. But if there are two classes with 190 and 10 members, we could only balance up to 20 genomes (10 from the group with 10, 10 from the group with 190) - so we'd either have to warn that the outcome was unbalanced, or we'd only be able to balance 10 randomly-selected from each class. So we might want another argument:
--enforce_balance
- which enforces equal numbers from each class. So if there areThis would provide three ways of getting a subsample of size$n$ from the original set:
The text was updated successfully, but these errors were encountered: