-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Sampling without replacement option for simgenotype #194
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great! I think the tests are pretty comprehensive
and the code seems pretty straightforward
great job!
@@ -113,6 +122,9 @@ def output_vcf( | |||
for ind, sample in enumerate(vcf.samples): | |||
sample_dict[sample] = ind | |||
|
|||
# initialize hap_used array which should contain list of lists for each reference sample's genome and what segment has been used for each | |||
haps_used = [[] for samp in range(len(vcf.samples)*2)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this scale well for large numbers of samples? I recall that python dynamically allocates memory for list items, even within list comprehensions
so it might have to repeatedly make copes whilst constructing this list?
Honestly, it might not even matter in terms of time, compared to other things in our code. I just wanted to confirm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adapting all this to not rely on Pandas! Overall this seems great. The primary concern is with regard to how random the haplotypes will be. Also, in admix-simu, we ensured that the haplotype changes at each segment breakpoint. It seems like this code may not do that? Look forward to this being added to haptools!
Update haplotype choices for replacement so it doesn't switch when it isn't changing segments
Ensured that every segment we grab is randomly selecting a sample and not just the first sample in the list over and over again
Added sampling without replacement flag in simgenotype
--no_replacement
that allows the user to run simgenotype and when outputting the newly simulated variants, the reference vcf used will be sampled without replacement per region.Code adapted from Amy Williams PR that originally used pandas but now uses numpy and lists