-
Notifications
You must be signed in to change notification settings - Fork 27
pimp my pomp
- Keeping a database of parameter-space explorations
- Reproducibility on a multicore machine via
bake
,stew
, andfreeze
- How can I include a vector of variables in a Csnippet?
- [How to handle missing data](#How to handle missing data)
- [How to deal with accumulator variables and t0 much less than t[1]](# How to deal with accumulator variables and t0 much less than t[1])
Likelihood surfaces for dynamic models can be very complex and the computations needed to explore them can be expensive. By keeping a record of all parameter points visited, along with the computed likelihood at each point, is a good way to ensure that you continually improve your picture of the likelihood surface.
Doing this can be as simple as maintaining a CSV file with one column for each parameter, plus the likelihood (and s.e.). It can be useful to supplement this with an indication of the name of the model and any other qualifying information.
It is often the case that heavy pomp computations are best performed in parallel on a cluster or multi-core machine.
This poses some challenges in trying to ensure reproducibility and avoiding repetition of expensive calculations.
The bake
, stew
, and freeze
functions provide some useful facilities in this regard.
For example:
require(pomp)
pompExample(ricker)
require(foreach)
require(doMC)
registerDoMC(5)
bake(file="pfilter1.rds",seed=607686730,kind="L'Ecuyer",{
foreach (i=1:10, .combine=c,
.options.multicore=list(set.seed=TRUE)) %dopar% {
pf <- pfilter(ricker,Np=1000)
logLik(pf)
}
}) -> ll
In the above bake
first checks to see whether the file pfilter1.rds
exists.
If it does, it then loads it (using readRDS
) and stores the result in ll
.
If it does not, it evaluates the expression embraced in the brackets, stores the result in pfilter1.rds
, and returns it.
While the expression is evaluated, the R session's pseudorandom number generator (RNG) is temporarily set to the state specified by the seed
and kind
arguments (see ?set.seed
).
In this case, the expression to be evaluated makes use of the foreach
and doMC
packages to run 10 particle filtering operations in parallel on a multicore machine.
We therefore use a parallel RNG ("L'Ecuyer"
).
The bake
function stores or retrieves and returns a single R object.
If one wants to produce multiple objects in a reproducible way, use stew
.
For example:
require(pomp)
pompExample(ricker)
stew(file="pfilter2.rda",seed=607686730,kind="L'Ecuyer",{
te <- system.time(
foreach (i=1:10, .combine=c,
.options.multicore=list(set.seed=TRUE)) %dopar% {
pf <- pfilter(ricker,Np=1000)
logLik(pf)
} -> ll2
)
})
In the above, stew
again temporarily sets the RNG state before evaluating the expression.
The objects te
and ll2
are created during this evaluation;
these are stored in pfilter2.rda
to be retrieved if the snippet is run a second time.
Like bake
and stew
, the freeze
function temporarily sets the state of the RNG and evaluates an arbitrary R expression, and finally returns the RNG state to its status quo ante.
Unlike bake
and stew
, freeze
neither stores nor retrieves results.
##How to handle missing data Time series data are typically collected over equidistant time points, though often with missing data within the observation period. These should be flagged as NA in the time series in order to be properly interpreted by pomp. You may for example look at a modified variant of the SIR model describing the influenza outbreak in a boarding school:
read.csv(text="
B,day
NA, 0
1, 1
6, 2
26, 3
73, 4
NA, 5
293, 6
258, 7
236, 8
191, 9
124,10
69, 11
26, 12
11, 13
4, 14
")->bsflu_na
bsflu_melt <-melt(bsflu_na, id.var='day')
bsflu_melt <-na.omit(bsflu_melt)
ggplot(data=bsflu_melt, aes(x=day, y=value, color=variable)) +
geom_line() + geom_point()
The data set shows a missing value at day 0 (which had not been flagged in the original data set) and at day 5 for which a pomp object is created that represents the same SIR model and measurement models as in the original example
sir_step <- Csnippet("
double dN_SI = rbinom(S,1-exp(-Beta*I/N*dt));
double dN_IR = rbinom(I,1-exp(-gamma*dt));
S -= dN_SI;
I += dN_SI - dN_IR;
R += dN_IR;
H += dN_IR;
//Output for debugging
Rprintf(\" %f \\t %f \\n \",I,H);
")
sir_init <- Csnippet("
S = N-1;
I = 1;
R = 0;
H = 0;
")
dmeas <- Csnippet("lik = dbinom(B,H,rho,give_log);")
rmeas <- Csnippet("B = rbinom(H,rho);")
pomp(bsflu_na,time="day",t0=-8,rprocess=euler.sim(sir_step,delta.t=1/12),
initializer=sir_init, rmeasure=rmeas,dmeasure=dmeas,
zeronames="H",paramnames=c("Beta","gamma","N", "rho"),
statenames=c("S","I","R","H")) -> sir_na
sims_na <- simulate(sir_na,params=c(Beta=1.5,gamma=1,rho=0.9,N=2600),
nsim=15,as=TRUE,include=TRUE)
Note that the simulation starts at t0=-8
(as defined in the pomp object), therefore the accumulator variable H collects events over 8 days instead of 1 day for day 0 which can be seen when plotting the simulation results
sims_na %>%
melt(id=c("sim","time")) %>%
ggplot(aes(x=time,group=sim,color=sim,y=value))+
geom_line()+
facet_wrap(~variable,scales="free_y")
While the range of simulations can be set via
sims_na <- simulate(sir_na,params=c(Beta=1.5,gamma=1,rho=0.9,N=2600),
times=c(0:14), nsim=15,as=TRUE,include=TRUE)
see also [how to deal with accumulator variables and t0 much less than t[1].](# How to deal with accumulator variables and t0 much less than t[1])
As shown [above](#How to handle missing data), the accumulator variable collects events between any two observations in the time series provided, i.e. were the missing data at day 5 not flagged as NA but represented by a missing row the the accumulator variable would have collected events over 2 days which usually will not represent what is listed in time series data. A similar issue arises if the simulation needs to start at t0 much less than t[1] which can be circumvented by introducing a "missing value" at t[0] as shown above. The accumulator variable shows a proper value for the first data point in the time series at t[1] which can be seen when plotting the subset starting at t[1].
sims_na %>%
subset(time>"0")%>%
melt(id=c("sim","time")) %>%
ggplot(aes(x=time,group=sim,color=sim,y=value))+
geom_line()+
facet_wrap(~variable,scales="free_y")