Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting error "Problem with output in nvidia-smi pmon -c 10" #67

Open
guillaumeramey opened this issue Jun 7, 2021 · 3 comments
Open

Comments

@guillaumeramey
Copy link

Hi, we're getting this error in the log file :

experiment_impact_tracker.compute_tracker.ImpactTracker - ERROR - Encountered exception within power monitor thread!
experiment_impact_tracker.compute_tracker.ImpactTracker - ERROR -   File "/usr/local/lib/python3.7/dist-packages/experiment_impact_tracker/compute_tracker.py", line 105, in launch_power_monitor
    _sample_and_log_power(log_dir, initial_info, logger=logger)
  File "/usr/local/lib/python3.7/dist-packages/experiment_impact_tracker/compute_tracker.py", line 69, in _sample_and_log_power
    results = header["routing"]["function"](process_ids, logger=logger, region=initial_info['region']['id'], log_dir=log_dir)
  File "/usr/local/lib/python3.7/dist-packages/experiment_impact_tracker/gpu/nvidia.py", line 117, in get_nvidia_gpu_power
    raise ValueError('Problem with output in nvidia-smi pmon -c 10')

Is it an issue with our Nvidia GPU ? We are using Tesla T4.

@Breakend
Copy link
Owner

Breakend commented Jun 8, 2021

Could you let us know what output you get if you run this from the command line on the machine you're using? This will help narrow down the source of the error.

$ nvidia-smi pmon -c 10

@guillaumeramey
Copy link
Author

I am using Google Colab so it's not always the same GPU.
I ran subprocess.getoutput('nvidia-smi pmon -c 10') but it gave me nothing:

# gpu        pid  type    sm   mem   enc   dec   command
# Idx          #   C/G     %     %     %     %   name
    0          -     -     -     -     -     -   -              
    0          -     -     -     -     -     -   -              
    0          -     -     -     -     -     -   -              
    0          -     -     -     -     -     -   -              
    0          -     -     -     -     -     -   -              
    0          -     -     -     -     -     -   -              
    0          -     -     -     -     -     -   -              
    0          -     -     -     -     -     -   -              
    0          -     -     -     -     -     -   -              
    0          -     -     -     -     -     -   -              

With subprocess.getoutput('nvidia-smi') I obtained this:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   45C    P8    29W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

@Breakend
Copy link
Owner

Hi, unfortunately colab isn't fully supported right now because they don't always expose the required hardware endpoints to calculate energy use. We are working on solutions and will follow up if we have something that works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants