Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Sampling without replacement option for simgenotype #194

Merged
merged 7 commits into from
Feb 23, 2023

Conversation

mlamkin7
Copy link
Collaborator

@mlamkin7 mlamkin7 commented Feb 17, 2023

Added sampling without replacement flag in simgenotype --no_replacement that allows the user to run simgenotype and when outputting the newly simulated variants, the reference vcf used will be sampled without replacement per region.

Code adapted from Amy Williams PR that originally used pandas but now uses numpy and lists

@mlamkin7 mlamkin7 requested a review from aryarm February 17, 2023 20:20
Copy link
Member

@aryarm aryarm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great! I think the tests are pretty comprehensive
and the code seems pretty straightforward
great job!

@@ -113,6 +122,9 @@ def output_vcf(
for ind, sample in enumerate(vcf.samples):
sample_dict[sample] = ind

# initialize hap_used array which should contain list of lists for each reference sample's genome and what segment has been used for each
haps_used = [[] for samp in range(len(vcf.samples)*2)]
Copy link
Member

@aryarm aryarm Feb 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this scale well for large numbers of samples? I recall that python dynamically allocates memory for list items, even within list comprehensions
so it might have to repeatedly make copes whilst constructing this list?

Honestly, it might not even matter in terms of time, compared to other things in our code. I just wanted to confirm

Copy link

@amythewilliams amythewilliams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adapting all this to not rely on Pandas! Overall this seems great. The primary concern is with regard to how random the haplotypes will be. Also, in admix-simu, we ensured that the haplotype changes at each segment breakpoint. It seems like this code may not do that? Look forward to this being added to haptools!

Update haplotype choices for replacement so it doesn't switch when it isn't changing segments
Ensured that every segment we grab is randomly selecting a sample and not just the first sample in the list over and over again
@mlamkin7 mlamkin7 merged commit 85bd494 into main Feb 23, 2023
@aryarm aryarm deleted the feat/outvcf_no_replacement branch February 24, 2023 00:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants