Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sampling Performance Testing #3584

Merged
merged 490 commits into from
Jan 12, 2024

Conversation

alexbarghi-nv
Copy link
Member

@alexbarghi-nv alexbarghi-nv commented May 19, 2023

Adds performance benchmarking scripts for testing MNMG cuGraph GNN workflows.
This branch is the head branch for the cuGraph benchmarking effort. All work supporting the benchmarks should be merged into this branch. It will be merged into branch-24.02 once all features are ready.

Includes patches to cuGraph-PyG required for the latest DLFW container.

To-Do:

  • Refactor for branch-24.02
  • Add WholeGraph training portion Deferred to future PR (see Add WholeGraph Support alexbarghi-nv/cugraph#6)
  • Add WholeGraph generators Included in above
  • Support DGL Deferred to future PR
  • Use appropriate docker containers Deferred, waiting on DLFW release

Closes #3839

@alexbarghi-nv alexbarghi-nv added feature request New feature or request Blocked Cannot progress due to external reasons non-breaking Non-breaking change labels May 19, 2023
@alexbarghi-nv alexbarghi-nv self-assigned this May 19, 2023
@alexbarghi-nv alexbarghi-nv added this to the 23.08 milestone May 19, 2023
@BradReesWork BradReesWork changed the base branch from branch-23.06 to branch-23.08 May 30, 2023 12:55
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@BradReesWork BradReesWork modified the milestones: 23.08, 23.10 Jul 25, 2023
@alexbarghi-nv alexbarghi-nv requested a review from a team as a code owner January 5, 2024 19:16
@alexbarghi-nv
Copy link
Member Author

/ok to test

@alexbarghi-nv
Copy link
Member Author

/ok to test

@alexbarghi-nv alexbarghi-nv marked this pull request as draft January 5, 2024 19:26
@alexbarghi-nv
Copy link
Member Author

/ok to test

@alexbarghi-nv
Copy link
Member Author

/ok to test

Copy link
Member

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requested minor changes, mostly looks good.

the number of training epochs here. These are followed by the `REPLICATION_FACTOR` argument, which
can be used to create replications of the dataset for scale testing purposes.

The final two arguments are `FRAMEWORK` which can be either "cuGraphPyG" or "PyG", and `GPUS_PER_NODE`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume we shall include "cuGraphDGL" here too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the next PR

SCRIPTS_DIR=$4
NUM_EPOCHS=$5

SAMPLES_DIR=/samples
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming we are mounting this to the most performant path.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is up to us to set LOGS_DIR, SAMPLES_DIR, and DATASETS_DIR in run_train_job.sh correctly. In the srun command, those are mounted to /logs, /samples, and /datasets in the container that this script runs in.

mg_utils/wait_for_workers.py Show resolved Hide resolved
Copy link
Member

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@alexbarghi-nv
Copy link
Member Author

/ok to test

@alexbarghi-nv
Copy link
Member Author

/ok to test

Copy link
Contributor

@rlratzel rlratzel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. I don't feel too strongly, but I noticed several places in the shell scripts that assume things about the file system (/datasets, etc.). We usually put those scripts in another repo (the repo containing our machine-specific nightly scripts, etc.) and not the open cugraph repo.

@alexbarghi-nv
Copy link
Member Author

LGTM overall. I don't feel too strongly, but I noticed several places in the shell scripts that assume things about the file system (/datasets, etc.). We usually put those scripts in another repo (the repo containing our machine-specific nightly scripts, etc.) and not the open cugraph repo.

The /datasets and /scripts directories are mounted to directories the user has to provide. They default to the current working directory (it creates folders there and mounts them). So we're not exposing any NVIDIA internal info.

Copy link
Collaborator

@ChuckHastings ChuckHastings left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why we need to update the copyright on a file that wasn't otherwise updated... but OK.

@alexbarghi-nv
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit c09db10 into rapidsai:branch-24.02 Jan 12, 2024
97 checks passed
rapids-bot bot pushed a commit that referenced this pull request Mar 11, 2024
…formance Improvements (#4081)

Large-scale cuGraph-DGL performance testing scripts.  Also changes the DGL and PyG scripts to evaluate on all ranks and reuse the test samples, and adds support for benchmarking cuGraph-DGL/cuGraph-PyG with WholeGraph.

Updates `cuGraph.gnn.FeatureStore` and `cuGraph-PyG` for increased performance:
* Supporting passing in a WG embedding directly to cugraph.gnn.FeatureStore
* Simplifying how cuGraph-PyG handles filtering and using a cache to prevent repeatedly copying data between the device and host
* Fix bug in cugraph.gnn.FeatureStore where indexing with a gpu tensor would raise an exception, especially with WG
* Add a function to cugraph.gnn.FeatureStore to check where data is stored, which is used by cuGraph-PyG to prevent unnecessary d2h and h2d copies

Merge after #3584

Authors:
  - Alex Barghi (https://github.com/alexbarghi-nv)
  - Seunghwa Kang (https://github.com/seunghwak)
  - Vibhu Jawa (https://github.com/VibhuJawa)
  - Brad Rees (https://github.com/BradReesWork)

Approvers:
  - Vibhu Jawa (https://github.com/VibhuJawa)
  - Don Acosta (https://github.com/acostadon)
  - Brad Rees (https://github.com/BradReesWork)
  - Naim (https://github.com/naimnv)
  - Joseph Nke (https://github.com/jnke2016)

URL: #4081
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Reference CuGraphSAGE Implementation to cuGraph-PyG
6 participants