-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Consolidate remaining CUDA runtime calls in bench/ann
#1318
Comments
Note, #1661 goes the other way around. It minimizes the use of the utilities in |
@tfeher and @achirkin- I'm still not convinced that #1661 is the direction we want to be taking the benchmarks. First, we have recently removed That being said, Im not totally against the use of gbench in the benchmarks but for now we are proceeding with the cpu-only by isolating the builds of the existing executables. |
Hi @cjnolet , I would also hesitate to use dlopen in a library component, but in the PR it is only ever used in the With the latest fixes to python scripts the PR is ready now. Surely in the end the decision is up to you. Just let me know if you prefer to keep the current benchmarks as is - I'll just pivot the gbench branch into a separate repo for internal use. But I cannot maintain it up-to-date with the ongoing changes indefinitely. |
Unfortunately, we are now pointing end-users towards these benchmarks and they are quickly becoming a vital part of the RAFT toolset. I think we need to adjust the perspective that these are just for internal use and that we shouldn't be concerned with the user experience here. dlopen has bitten us too many times in the past and I'd like to avoid the issues we faced trying to support it.
I think what you might be saying here is that having a single executable for the RAFT algorithm benchmarks is a must-have, right? RAFT algorithm benchmarks don't need a cpu-only option, so I'm not sure why all of the benchmarks need to be consolidated into a single binary when only the HNSW (and potentially some of the FAISS algorithms) need to be cpu-only. Separate binaries raft, faiss cpu, faiss gpu, hnsw, and GGNN should be fine, no? Then we only need to link cuda for those binaries that require gpu and we can more easily separate the other binaries into cpu-only packaging.
Even if this is the direction you decide to take these changes, we still need to make sure that any benchmarks we are publishing publicly for users are performed with the tools which are publicly available in RAFT so that we aren't sharing results which they cannot directly reproduce. My suggestion would be that we try to make this work, and if my suggestion above is enough for that, then I think we can consider that an alternative to pulling this out. |
The way it works now in the PR is that we have a single executable I need to be able to run all benchmarks via a single executable and produce a single .csv/json report, because I often cannot rely on multiple layers of python/perl wrappers: the require setting up conda environment and lack most of the needed functionality. If having separate executables is the only issue you have with the PR, I can change the cmake config to produce both: shared libraries + ANN_BENCH and the executables for every algorithm (which don't use dlopen). |
dlopen is the main deal breaker there, so we need to fix that part, and not by over complicating the cmake to fork two different build paths. I'm still not sure I understand the motivation for only having a single executable- nsys and csv output will work just the same if we provide a bash script or a python script running multiple executables or a single executable. I'm also fine producing two individual executables (one which is cpu-only and one with gpu+cpu). Neither of the options I've mentioned above require dlopen for either case and both still achieve what you are looking for. |
It takes time and effort to combine multiple csv files into one, because the columns are different; preamble/context is also important as it contains the information about the dataset and GPU. @cjnolet please have a look at the latest update. I've set up the cmake to produce one executable per benchmark, same as before, by default; they don't link against/use I hope we still can retain |
Thank @achirkin but again |
The main reason I have I need it to be able to plug-and-play multiple benchmark implementations and then ideally run them from a single executable. A typical scenario: I build At this point, dlopen is only ever used in |
While working on #1304, there are still several places in
cpp/bench/ann/common
where cuda runtime calls are being made directly even though there are utilities in RAFT that could be used instead. There are also some calls that are allocating and freeing memory directly, rather than using RMM. I didn't want to change these initially for fear that it might introduce some unexpected bugs but this should be revisited and the CUDA runtime calls consolidated to utilize RAFT APIs wherever possible.The text was updated successfully, but these errors were encountered: