forked from mdkarcher/phylodyn
-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathSLT_Simulation.Rmd
68 lines (49 loc) · 2.85 KB
/
SLT_Simulation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
---
title: "Single Lineage Tracing Simulation"
author: "Julia Palacios, Mackenzie Simper"
output: rmarkdown::html_vignette
date: "2024-10-07"
vignette: >
%\VignetteIndexEntry{SLT_Simulation}
%\VignetteEngine{knitr::rmarkdown}
\usepackage[utf8]{inputenc}
---
Single Lineage Tracing Simulation
===================================
We will first simulate a genealogy with 50 tips under the coalescent model with *exp_tra* effective population size.
```{r}
library("ape")
exp_traj<-function(t){
result=rep(0,length(t))
result[t<=0.1]<-5
result[t>0.1 & t<0.5]<-5*exp(0.5-5*t[t>0.1 & t<0.5]) #50 times smaller
result[t>=0.5]<-0.6766764
return(result)
}
simul1<-coalsim(samp_times = 0, n_sampled = n, traj = exp_traj,method="tt",val_upper=11)
mytree<-generate_newick((simul1))$newick
#plot(mytree,show.tip.label=FALSE)
#axisPhylo()
#summary<-coalescent.intervals(mytree)
#treelength<-sum(summary$lineages*summary$interval.length)
```
We will now simulate a barcode with $S=3$ target sites evolving on the tree just generated according to a CTMC with rate 1 at individual sites and rate of 0.5 of overlapping mutations. All rates are specified in the $Q$ matrix. We further assume there are $M=20$ allele types.
```{r}
#source("new_functions.R")
#source("all_functions.R")
#source("phylotime.R")
library("ape")
library("phylodyn")
library("hash")
set.seed(123)
S = 3 #number of cut sites
theta = matrix(1, nrow = S, ncol = S)
#set different rates for multi-cut
theta[1, 2] <-theta[1, 3] <-theta[2,3]<- 0.5
sdata<-simulate_SLData(numSim=1,mytree,S=3,M=20,theta=theta)
print(sdata)
```
*simulate_SLData* returns a list of length $numSim$ where each entry contains the results from an independent simulation of the mutation process on the tree. Each entry in the list is a list with with two matrices of size $(2n - 1) \times S$. Rows corresponds to node in the tree and columns correspond to sites. Leaves are labeled $1, 2, \dots, n$ and interior nodes are $n + 1, \dots 2n - 1$, with the root labeled $n + 1$.
1. The first matrix stores the mutation states, which are the "lumped" mutations without specific allele labels. For each site $s$, an entry $0$ means there is no mutation, the site is "active"; $1$ means there was a single-cut at site $s$; and $r+1$ means there was a simultaneous cut that started at site $r \le s$.
2. The second matrix stores the specific allele states. Again, $0$ denotes an active site with no mutation; $s:b$ denotes a mutation labeled $b$ from a single-cut at site $s$ and $rs:b$ denotes a mutation labeled $b$ that overlaps from sites $r < s$. For each site, the number $b$ is sampled randomly from $1$ to $M$.
For example, $(0, 2:3, 35:1, 35:1, 35:1)$ means there is no mutation in site $1$, a single mutation in site $2$, and the same mutation in sites $3$ through $5$ from a simultaneous cut. The corresponding mutation state would be $(0, 1, 4, 4, 4)$.