Skip to content
Aaron A King edited this page Jul 26, 2016 · 42 revisions

Tips, tricks, advice, and examples.

Keeping a database of parameter-space explorations

Likelihood surfaces for dynamic models can be very complex and the computations needed to explore them can be expensive. By keeping a record of all parameter points visited, along with the computed likelihood at each point, is a good way to ensure that you continually improve your picture of the likelihood surface.

Doing this can be as simple as maintaining a CSV file with one column for each parameter, plus the likelihood (and s.e.). It can be useful to supplement this with an indication of the name of the model and any other qualifying information.

Reproducibility on a multicore machine via bake, stew, and freeze

It is often the case that heavy pomp computations are best performed in parallel on a cluster or multi-core machine. This poses some challenges in trying to ensure reproducibility and avoiding repetition of expensive calculations. The bake, stew, and freeze functions provide some useful facilities in this regard.

For example:

require(pomp)
pompExample(ricker)

require(foreach)
require(doMC)
registerDoMC(5)

bake(file="pfilter1.rds",seed=607686730,kind="L'Ecuyer",{
  foreach (i=1:10, .combine=c, 
           .options.multicore=list(set.seed=TRUE)) %dopar% {
     pf <- pfilter(ricker,Np=1000)
     logLik(pf)
   }
}) -> ll

In the above bake first checks to see whether the file pfilter1.rds exists. If it does, it then loads it (using readRDS) and stores the result in ll. If it does not, it evaluates the expression embraced in the brackets, stores the result in pfilter1.rds, and returns it. While the expression is evaluated, the R session's pseudorandom number generator (RNG) is temporarily set to the state specified by the seed and kind arguments (see ?set.seed). In this case, the expression to be evaluated makes use of the foreach and doMC packages to run 10 particle filtering operations in parallel on a multicore machine. We therefore use a parallel RNG ("L'Ecuyer").

The bake function stores or retrieves and returns a single R object. If one wants to produce multiple objects in a reproducible way, use stew. For example:

require(pomp)
pompExample(ricker)

stew(file="pfilter2.rda",seed=607686730,kind="L'Ecuyer",{
  te <- system.time(
  foreach (i=1:10, .combine=c, 
           .options.multicore=list(set.seed=TRUE)) %dopar% {
      pf <- pfilter(ricker,Np=1000)
      logLik(pf)
   } -> ll2
  )
})

In the above, stew again temporarily sets the RNG state before evaluating the expression. The objects te and ll2 are created during this evaluation; these are stored in pfilter2.rda to be retrieved if the snippet is run a second time.

Like bake and stew, the freeze function temporarily sets the state of the RNG and evaluates an arbitrary R expression, and finally returns the RNG state to its status quo ante. Unlike bake and stew, freeze neither stores nor retrieves results.

How can I include a vector of variables in a Csnippet?

See this entry in the FAQ.

How to handle missing data

Missing data are a common complication. pomp can handle NAs in the data, but the measurement model probability density function, dmeasure, if it is needed, must be written so as to deal with NAs appropriately. For example, look at the following variant of the SIR model describing the influenza outbreak in a boarding school:

library(pomp)
library(magrittr)
library(ggplot2)
library(reshape2)
library(plyr)

read.csv(text="
B,day
NA,0
1,1
6,2
26,3
73,4
NA,5
293,6
258,7
236,8
191,9
124,10
69,11
26,12
11,13
4,14
") -> dat

dat %>%
  na.omit() %>%
  ggplot(aes(x=day, y=B)) +
  geom_line() + geom_point()

Data are missing at days 0 and 5.

We create a pomp object implementing a simple SIR process model and a binomial measurement model, as in the original example. The only difference is in the dmeasure:

sir_step <- Csnippet("
  double dN_SI = rbinom(S,1-exp(-Beta*I/N*dt));
  double dN_IR = rbinom(I,1-exp(-gamma*dt));
  S -= dN_SI;
  I += dN_SI - dN_IR;
  R += dN_IR;
  H += dN_IR;
")

sir_init <- Csnippet("
  S = N-1;
  I = 1;
  R = 0;
  H = 0;
")

rmeas <- Csnippet("B = rbinom(H,rho);")

dmeas <- Csnippet("if (ISNA(B)) {
                    lik = (give_log) ? 0 : 1;
                  } else {
                    lik = dbinom(B,H,rho,give_log);
                  }")

dat %>%
  pomp(time="day",t0=0,
       rprocess=euler.sim(sir_step,delta.t=1/12),
       initializer=sir_init,
       rmeasure=rmeas,
       dmeasure=dmeas,
       zeronames="H",
       paramnames=c("Beta","gamma","N", "rho"),
       statenames=c("S","I","R","H")
       ) -> sir_na

Note that the dmeasure returns a likelihood of 1 (log likelihood 0) for the missing data. [What's the probability of seeing something if you don't look?] When there is an observation, it returns a binomial likelihood as usual.

To check for missing data, note that we use the ISNA macro, which is part of R's C API.

Our simulations now fill in simulated values for the missing data,

sir_na %>%
  simulate(params=c(Beta=3,gamma=2,rho=0.9,N=2600),
           nsim=10,as.data.frame=TRUE,include.data=TRUE) %>%
  ggplot(aes(x=time,y=B,group=sim,color=sim=="data"))+
  geom_line()+
  guides(color=FALSE)+
  theme_bw()

and the particle filter handles the missing data correctly:

sir_na %>%
  pfilter(Np=1000,params=c(Beta=3,gamma=2,rho=0.9,N=2600)) %>%
  as.data.frame() %>%
  melt(id="time") %>%
  ggplot(aes(x=time,y=value,color=variable))+
  guides(color=FALSE)+
  geom_line()+
  facet_wrap(~variable,ncol=1,scales='free_y')+
  theme_bw()

In the above particle filter computation, notice that the effective sample size (ESS) is full, as it should be, when the missing data contribute no information.

How to deal with accumulator variables when t0 is much less than t[1]

As described in the documentation (?pomp), pomp allows you to define "accumulator variables" that collect events occurring between observations. The example above illustrates this.

In that example, we took t0 equal to the first observation time. Sometimes we want to take t0 to be much earlier. For example, if we wish to posit that the initial state of the unobserved Markov process is distributed at t0 according to its stationary distribution, one way to achieve this is to put t0 so early that simulations will equilibrate before the first observation is made.

This is straightforward, but the presence of accumulator variables leads to a small twist. Let's look at the boarding-school flu example to illustrate.

library(pomp)
library(magrittr)
library(ggplot2)
library(reshape2)
library(plyr)

sir_step <- Csnippet("
  double dN_SI = rbinom(S,1-exp(-Beta*I/N*dt));
  double dN_IR = rbinom(I,1-exp(-gamma*dt));
  S -= dN_SI;
  I += dN_SI - dN_IR;
  R += dN_IR;
  H += dN_IR;
  ")

sir_init <- Csnippet("
  S = N-1;
  I = 1;
  R = 0;
  H = 0;
")

rmeas <- Csnippet("B = rbinom(H,rho);")

dmeas <- Csnippet("if (ISNA(B)) {
                  lik = (give_log) ? 0 : 1;
} else {
                  lik = dbinom(B,H,rho,give_log);
}")

read.csv(text="
day,B
1,3
2,8
3,28
4,76
5,222
6,293
7,257
8,237
9,192
10,126
11,70
12,28
13,12
14,5
") -> dat

dat %>%
  pomp(time="day",t0=-8,
       rprocess=euler.sim(sir_step,delta.t=1/12),
       initializer=sir_init,
       rmeasure=rmeas,
       dmeasure=dmeas,
       zeronames="H",
       paramnames=c("Beta","gamma","N", "rho"),
       statenames=c("S","I","R","H")
       ) -> bsflu

Note that, as above, we've allowed for the possibility of NAs in the data.

Now let's look at the data and some simulations from the model.

bsflu %>%
  as.data.frame() %>%
  ggplot(aes(x=time,y=B))+
  geom_point()+geom_line()+
  theme_bw()

bsflu %>%
  simulate(params=c(Beta=1.5,gamma=1,rho=0.9,N=2600),
           nsim=10,as.data.frame=TRUE,include.data=TRUE) %>%
  ggplot(aes(x=time,y=B,group=sim,color=sim=="data"))+
  geom_line()+
  guides(color=FALSE)+
  theme_bw()

The data plot looks fine, but what's going on with the simulations? It's easy to understand: we are assuming that there is one infectious host at t = t0 = -8 days. In most of the simulations, this leads to a number of new infections, which the H variable accumulates from t=-8 to t=1, the time of our first observation. But our first observation is not the number of new cases to have occurred in the last 9 days, but the number that occurred (and were reported) in the last 1 day.

We can fix this by introducing a fictitious "observation" at t=0, with missing data, i.e., by assuming we observed nothing at t=0:

dat %>%
  rbind(data.frame(day=0,B=NA)) %>%
  arrange(day) %>%
  pomp(time="day",t0=-8,
       rprocess=euler.sim(sir_step,delta.t=1/12),
       initializer=sir_init,
       rmeasure=rmeas,
       dmeasure=dmeas,
       zeronames="H",
       paramnames=c("Beta","gamma","N", "rho"),
       statenames=c("S","I","R","H")
  ) -> bsflu

bsflu %>%
  simulate(params=c(Beta=1.5,gamma=1,rho=0.9,N=2600),
           nsim=10,as.data.frame=TRUE,include.data=TRUE) %>%
  subset(time>=1) %>%
  ggplot(aes(x=time,y=B,group=sim,color=sim=="data"))+
  geom_line()+
  guides(color=FALSE)+
  theme_bw()
Clone this wiki locally