You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
cudf.pandas claims to be a drop-in replacement for pandas with support for GPU acceleration (on NVIDIA GPUs) and support for "100% of the pandas API".
While I suspect we wouldn't get anywhere near the speedups illustrated on their benchmarks, it may be worth investigating if using cudf.pandas provides any performance advantage for TLOmodel simulations on systems with an NVIDIA GPU. The cudf.pandas module provides an install function for monkey-patching existing pandas import or there are command-line options and an IPython extension for doing the same, so technically this shouldn't require any changes on the TLOmodel side. From a very brief attempt at running this on a Google Colab instance, it seems the claim of 100% API compatibility is not accurate as we get an error at
[/usr/local/lib/python3.10/dist-packages/tlo/methods/healthburden.py](https://localhost:8080/#) in read_parameters(self, data_folder)
68 p['DALY_Weight_Database'] = pd.read_csv(Path(self.resourcefilepath) / 'ResourceFile_DALY_Weights.csv')
69 p['Age_Limit_For_YLL'] = 70.0 # Assumption that only deaths younger than 70y incur years of lost life
---> 70 p['gbd_causes_of_disability'] = set(pd.read_csv(
71 Path(self.resourcefilepath) / 'gbd' / 'ResourceFile_CausesOfDALYS_GBD2019.csv', header=None)[0].values)
72
when trying to run a simulation with fullmodel, which appears to be due to the accessing the dataframe column using an integer 0 index rather than string "0", despite the former working in standard Pandas (though I suspect the latter is probably the recommended as generally column names are strings). If it's just relatively minor differences like this it would probably not be a massive amount of work to try to get this working, but hard to tell without investigating further.
The text was updated successfully, but these errors were encountered:
Hi @matt-graham ! I came across this issue due to the cudf.pandas reference (I work on this and other RAPIDS projects). Glad to see you're interested in cudf.pandas.
It looks like this error is coming from this cuDF issue. It's definitely a bug. We'll explore what solving it might look like.
In the meantime, a potential workaround might be to temporarily switch this line to instead grab the first column with something like .iloc[:, 0] that doesn't rely on the column name (since the file has no header anyway). Would love to see if cudf.pandas can provide a speedup here!
Hi @beckernick! Thanks for the pointer to the issue and for the suggested workaround, will have a look at implementing this and seeing if we hit against any other problems.
I just noticed that the cudf.pandas docs indicate that currently compatibility with pandas 1.5.x is being targetted and there is an issue connected to adding pandas 2.0 support at rapidsai/cudf#12794 - as we're requiring pandas 2.0 or above here we may need to also wait for that to be resolved.
cudf.pandas
claims to be a drop-in replacement forpandas
with support for GPU acceleration (on NVIDIA GPUs) and support for "100% of the pandas API".While I suspect we wouldn't get anywhere near the speedups illustrated on their benchmarks, it may be worth investigating if using
cudf.pandas
provides any performance advantage for TLOmodel simulations on systems with an NVIDIA GPU. Thecudf.pandas
module provides aninstall
function for monkey-patching existingpandas
import or there are command-line options and an IPython extension for doing the same, so technically this shouldn't require any changes on theTLOmodel
side. From a very brief attempt at running this on a Google Colab instance, it seems the claim of 100% API compatibility is not accurate as we get an error atwhen trying to run a simulation with
fullmodel
, which appears to be due to the accessing the dataframe column using an integer0
index rather than string"0"
, despite the former working in standard Pandas (though I suspect the latter is probably the recommended as generally column names are strings). If it's just relatively minor differences like this it would probably not be a massive amount of work to try to get this working, but hard to tell without investigating further.The text was updated successfully, but these errors were encountered: