Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore if using cudf.pandas provides any acceleration #1185

Open
matt-graham opened this issue Nov 10, 2023 · 2 comments
Open

Explore if using cudf.pandas provides any acceleration #1185

matt-graham opened this issue Nov 10, 2023 · 2 comments
Assignees
Labels
performance question Further information is requested

Comments

@matt-graham
Copy link
Collaborator

cudf.pandas claims to be a drop-in replacement for pandas with support for GPU acceleration (on NVIDIA GPUs) and support for "100% of the pandas API".

While I suspect we wouldn't get anywhere near the speedups illustrated on their benchmarks, it may be worth investigating if using cudf.pandas provides any performance advantage for TLOmodel simulations on systems with an NVIDIA GPU. The cudf.pandas module provides an install function for monkey-patching existing pandas import or there are command-line options and an IPython extension for doing the same, so technically this shouldn't require any changes on the TLOmodel side. From a very brief attempt at running this on a Google Colab instance, it seems the claim of 100% API compatibility is not accurate as we get an error at

[/usr/local/lib/python3.10/dist-packages/tlo/methods/healthburden.py](https://localhost:8080/#) in read_parameters(self, data_folder)
     68         p['DALY_Weight_Database'] = pd.read_csv(Path(self.resourcefilepath) / 'ResourceFile_DALY_Weights.csv')
     69         p['Age_Limit_For_YLL'] = 70.0  # Assumption that only deaths younger than 70y incur years of lost life
---> 70         p['gbd_causes_of_disability'] = set(pd.read_csv(
     71             Path(self.resourcefilepath) / 'gbd' / 'ResourceFile_CausesOfDALYS_GBD2019.csv', header=None)[0].values)
     72

when trying to run a simulation with fullmodel, which appears to be due to the accessing the dataframe column using an integer 0 index rather than string "0", despite the former working in standard Pandas (though I suspect the latter is probably the recommended as generally column names are strings). If it's just relatively minor differences like this it would probably not be a massive amount of work to try to get this working, but hard to tell without investigating further.

@matt-graham matt-graham added question Further information is requested performance labels Nov 10, 2023
@matt-graham matt-graham self-assigned this Nov 13, 2023
@beckernick
Copy link

Hi @matt-graham ! I came across this issue due to the cudf.pandas reference (I work on this and other RAPIDS projects). Glad to see you're interested in cudf.pandas.

It looks like this error is coming from this cuDF issue. It's definitely a bug. We'll explore what solving it might look like.

In the meantime, a potential workaround might be to temporarily switch this line to instead grab the first column with something like .iloc[:, 0] that doesn't rely on the column name (since the file has no header anyway). Would love to see if cudf.pandas can provide a speedup here!

@matt-graham
Copy link
Collaborator Author

Hi @beckernick! Thanks for the pointer to the issue and for the suggested workaround, will have a look at implementing this and seeing if we hit against any other problems.

I just noticed that the cudf.pandas docs indicate that currently compatibility with pandas 1.5.x is being targetted and there is an issue connected to adding pandas 2.0 support at rapidsai/cudf#12794 - as we're requiring pandas 2.0 or above here we may need to also wait for that to be resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance question Further information is requested
Projects
Status: Issues
Development

No branches or pull requests

2 participants