Replies: 1 comment
-
Regarding the point:
That would involve saving, at a minimum, risk factor means and migration numbers for each age and sex group, for each year, and whatever else we want to synchronise in future (possibly per-individual, which would potentially become quite large). Moreover, running intervention after baseline would more or less double the simulation time. Given that the simulations @jzhu20 is running are on the order of 8 hours a piece, I'm not convinced we can afford to serialise this bit. Moreover still, one of the solutions discussed actually involves increasing this parallelism, simulating one baseline instance and I wonder if a solution here would be to compile any potential user-supplied modules, using like cython. If the simulation loop is to remain written in C++, then, in keeping everything customisable and fast, the language agnostic modules can be loaded at runtime as shared objects, using The alternative is to have the simulation loop itself in Python, given that iteration time relative to module processing time is insignificant. One would then have language specific bindings which would load and execute each module, and everything besides the most demanding of modules remains in Python. I'm inclined to think that this would be not much slower than it is now. |
Beta Was this translation helpful? Give feedback.
-
It seems that the researchers are talking about Pythonising certain parts of the Health-GPS codebase in order to make it easier to develop. While there are definitely some potential pitfalls, it could also be an opportunity, given the better ecosystem and tooling for Python vs. C++. The easiest way to interface between Python and C++ is with the excellent
pybind11
library, which allows both for invoking Python code from C++ and vice versa.An obvious place to start Pythonising the code would be for the input layer (see #357). The present implementation doesn't do proper input validation (#360, #361) and fixing it will require rewriting a large part of the existing code. Moreover, we also want to add some new features to the input layer (e.g. #363, #364) and this will be easier to write in Python. (NB: We're currently only looking at rewriting the code for handling static data rather than the config files for running specific simulations, but the principle is the same.)
One way to approach this would be to turn the C++ part of the HealthGPS into a Python library (using
pybind11
). For now, we would only need to expose a function (or method) to initiate the simulation. Input data structures which are currently filled from C++ could be wrapped usingpybind11
and exposed to Python code (for example), which could then fill them using normal Python code. These structures would then be passed to the C++ code via the entry point.Here's some pseudocode to show what I mean:
Note that while this would mean that users could start a Health-GPS simulation programmatically, we could still supply a command-line tool that behaves in the same way as the current
HealthGPS.Console
executable, so nothing would have to change from a user perspective.@jamesturner246 has talked about making the various components of Health-GPS more modular. Among other things, this would mean that different modules would consume only the input data they need, as opposed to the current scheme we have where all the input data is loaded in one place.
This could fit with the scheme I describe above. One possibility is that the Python side of the code could be responsible for instantiating the modules then passing them to the main simulation routine, e.g.:
Note that we wouldn't actually need to expose the internals of the Kevin Hall module in order to make this work. It could just be instantiated from Python then passed to C++. But note that if we ever did want to make it possible to write a risk factor model in Python, we could just write some bindings for the parent class and then users would be able to inherit from it like a normal Python class and pass this in instead.
Unfortunately, the current architecture of the software wouldn't lend itself terribly well to jamming random snippets of Python code in the middle of it, given that it relies on passing data back and forth between two threads (for the baseline and intervention simulations). Every time Python code is invoked, the GIL is taken and no other thread can run Python code, which in practice would mean that Health-GPS would spend a lot of time waiting for the GIL to be released. The good news is that these two threads are not really necessary. There's no reason we couldn't make Health-GPS run the baseline and intervention simulations one after the other, which would mean no locking and no message passing would be needed -- an improvement by itself! There are more details to think about here, but they probably deserve their own discussion...
Beta Was this translation helpful? Give feedback.
All reactions