Introducing Python to Health-GPS #370

alexdewar · 2024-05-14T16:58:10Z

alexdewar
May 14, 2024
Maintainer

It seems that the researchers are talking about Pythonising certain parts of the Health-GPS codebase in order to make it easier to develop. While there are definitely some potential pitfalls, it could also be an opportunity, given the better ecosystem and tooling for Python vs. C++. The easiest way to interface between Python and C++ is with the excellent pybind11 library, which allows both for invoking Python code from C++ and vice versa.

An obvious place to start Pythonising the code would be for the input layer (see #357). The present implementation doesn't do proper input validation (#360, #361) and fixing it will require rewriting a large part of the existing code. Moreover, we also want to add some new features to the input layer (e.g. #363, #364) and this will be easier to write in Python. (NB: We're currently only looking at rewriting the code for handling static data rather than the config files for running specific simulations, but the principle is the same.)

One way to approach this would be to turn the C++ part of the HealthGPS into a Python library (using pybind11). For now, we would only need to expose a function (or method) to initiate the simulation. Input data structures which are currently filled from C++ could be wrapped using pybind11 and exposed to Python code (for example), which could then fill them using normal Python code. These structures would then be passed to the C++ code via the entry point.

Here's some pseudocode to show what I mean:

from healthgps import InputData, run_simulation
import json
import pandas as pd
import sys

def load_input_data(path: str) -> InputData:
    # ... load config and return an InputData object

def main() -> None:
    data = load_input_data(sys.argv[1])  # NB: in reality, we'd do better argument parsing than this
    
    # Launch the simulation!
    run_simulation(data=data)  # Eventually, we'd be passing in more than just one object

if __name__ == "__main__":
    main()

Note that while this would mean that users could start a Health-GPS simulation programmatically, we could still supply a command-line tool that behaves in the same way as the current HealthGPS.Console executable, so nothing would have to change from a user perspective.

@jamesturner246 has talked about making the various components of Health-GPS more modular. Among other things, this would mean that different modules would consume only the input data they need, as opposed to the current scheme we have where all the input data is loaded in one place.

This could fit with the scheme I describe above. One possibility is that the Python side of the code could be responsible for instantiating the modules then passing them to the main simulation routine, e.g.:

from healthgps import run_simulation
from healthgps.modules.risk_factor import KevinHall, KevinHallInputData
import json
import pandas as pd

def load_kevin_hall_input_data(path: str) -> KevinHallInputData:
    # ... load data and return it

def main() -> None:
   # In reality, this would be done dynamically depending on the config file
   risk_factor_model = KevinHall(load_kevin_hall_input_data(some_path))

   # ... load other models

   # Launch simulation
   run_simulation(risk_factor_model=risk_factor_model)  # other modules would be passed in here

Note that we wouldn't actually need to expose the internals of the Kevin Hall module in order to make this work. It could just be instantiated from Python then passed to C++. But note that if we ever did want to make it possible to write a risk factor model in Python, we could just write some bindings for the parent class and then users would be able to inherit from it like a normal Python class and pass this in instead.

Unfortunately, the current architecture of the software wouldn't lend itself terribly well to jamming random snippets of Python code in the middle of it, given that it relies on passing data back and forth between two threads (for the baseline and intervention simulations). Every time Python code is invoked, the GIL is taken and no other thread can run Python code, which in practice would mean that Health-GPS would spend a lot of time waiting for the GIL to be released. The good news is that these two threads are not really necessary. There's no reason we couldn't make Health-GPS run the baseline and intervention simulations one after the other, which would mean no locking and no message passing would be needed -- an improvement by itself! There are more details to think about here, but they probably deserve their own discussion...

jamesturner246 · 2024-05-20T11:26:26Z

jamesturner246
May 20, 2024
Maintainer

Regarding the point:

There's no reason we couldn't make Health-GPS run the baseline and intervention simulations one after the other,

That would involve saving, at a minimum, risk factor means and migration numbers for each age and sex group, for each year, and whatever else we want to synchronise in future (possibly per-individual, which would potentially become quite large).

Moreover, running intervention after baseline would more or less double the simulation time. Given that the simulations @jzhu20 is running are on the order of 8 hours a piece, I'm not convinced we can afford to serialise this bit. Moreover still, one of the solutions discussed actually involves increasing this parallelism, simulating one baseline instance and n >= 1 intervention scenarios -- the baseline will run 1 time, rather than n times.

I wonder if a solution here would be to compile any potential user-supplied modules, using like cython. If the simulation loop is to remain written in C++, then, in keeping everything customisable and fast, the language agnostic modules can be loaded at runtime as shared objects, using dlopen().

The alternative is to have the simulation loop itself in Python, given that iteration time relative to module processing time is insignificant. One would then have language specific bindings which would load and execute each module, and everything besides the most demanding of modules remains in Python. I'm inclined to think that this would be not much slower than it is now.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introducing Python to Health-GPS #370

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Introducing Python to Health-GPS #370

alexdewar May 14, 2024 Maintainer

Replies: 1 comment

jamesturner246 May 20, 2024 Maintainer

alexdewar
May 14, 2024
Maintainer

jamesturner246
May 20, 2024
Maintainer