-
Notifications
You must be signed in to change notification settings - Fork 184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TF] Update TF v2.16.1 (without libfft) #9388
Conversation
A new Pull Request was created by @iarspider for branch IB/CMSSW_14_2_X/tf. @aandvalenzuela, @cmsbuild, @iarspider, @smuzaffar can you please review it and eventually sign? Thanks. |
cms-bot internal usage |
@cmsbuild please test for CMSSW_14_2_TF_X/el8_aarch64_gcc12 |
@cmsbuild please test for CMSSW_14_2_TF_X/el8_amd64_gcc12 |
@cmsbuild please test for CMSSW_14_2_TF_X/el8_aarch64_gcc12 |
@cmsbuild please test for CMSSW_14_2_TF_X/el8_amd64_gcc12 |
@smuzaffar since the tests depend on Tensorflow (not Keras), the environment was not set. Should I add an explicit dependency on keras to PhysicsTools/TensorFlow? I don't think we can handle circular dependency |
@iarspider , can you try running unit tests locally after setting the KERAS_BACKEND=tensorflow env? |
Pull request #9388 was updated. |
@cmsbuild please test for CMSSW_14_2_TF_X/el8_amd64_gcc12 |
@@ -3,6 +3,7 @@ | |||
<environment name="TENSORFLOW_BASE" default="@TOOL_ROOT@"/> | |||
<environment name="LIBDIR" default="$TENSORFLOW_BASE/lib"/> | |||
<environment name="INCLUDE" default="$TENSORFLOW_BASE/include"/> | |||
<environment name="KERAS_BACKEND" default="tensorflow"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should be <runtime ..../>
type variable ( see ROOTSYS as an example)
did you run test locally to see if gpu unit tests passed after setting this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested it by setting the environment variable manually (not via toolfile).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for me all the unit tests still fails with error
##Failure Location unknown## : Error
Test name: testHelloWorldCUDA::test
uncaught exception of type std::exception (or derived).
- An exception of category 'UnavailableAccelerator' occurred while
[0] Calling tensorflow::setBackend()
Exception Message:
Cuda backend requested, NVIDIA GPU visible to cmssw, but not visible to TensorF
low in the job
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some tests that were failing previously worked after setting KERAS_BACKEND
. Yes, I saw these failures as well - I thought I missed some setup step to make them work (in a container started with --nv
flag)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g. testTFConstSession was failing with ValueError: Unable to import backend : theano
, but after setting KERAS_BACKEND it passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the failure be due to 12.4 not being an officially tested CUDA version for TF 2.16.1 (and even 2.17) - link lists 12.3 as officially tested version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
prints this message:
successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
and returns an empty list []
. I googled this message, and there are basically three solutions:
- Run TensorFlow using official docker image
- Install TensorFlow using
conda
and prebuilt wheels - Force-connect NUMA node (as suggested in the document that TensorFlow prints out), namely run
sudo echo 0 | sudo tee -a /sys/bus/pci/devices/0000\:06\:10.0/numa_node
after each reboot. But that requiressudo
rights (and, I would imagine, not in container, but on the host).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you sure you started cmssw-el8
with --nv
option? For me the following command
python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
runs fine (both for this PR and TF_X Ibs) and return
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now it works for me as well, weird.
please abort |
Pull request #9388 was updated. |
@cmsbuild please test for CMSSW_14_2_TF_X/el8_amd64_gcc12 |
-1 Failed Tests: UnitTests GpuUnitTests The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:
You can see more details here: Unit TestsI found 1 errors in the following unit tests: ---> test testSiStripPayloadInspector had ERRORS GPU Unit TestsI found 9 errors in the following unit tests: ---> test testBrokenLineFitGPU_t had ERRORS ---> test testFitsGPU_t had ERRORS ---> test testTFGraphLoadingCUDA had ERRORS and more ... Comparison SummarySummary:
|
ChatGPT suggests using #include "tensorflow/core/public/session.h"
#include "tensorflow/core/protobuf/config.pb.h"
#include "tensorflow/core/platform/env.h"
#include "tensorflow/core/common_runtime/device_factory.h"
#include <iostream>
int main() {
// Initialize a session
tensorflow::Session* session;
tensorflow::SessionOptions options;
// Try to create a new session
tensorflow::Status status = tensorflow::NewSession(options, &session);
if (!status.ok()) {
std::cerr << "Error creating session: " << status.ToString() << std::endl;
return -1;
}
// Retrieve the list of available devices
std::vector<tensorflow::DeviceAttributes> devices;
status = session->ListDevices(&devices);
if (!status.ok()) {
std::cerr << "Error listing devices: " << status.ToString() << std::endl;
return -1;
}
// Check if any GPU devices are available
bool gpu_available = false;
for (const auto& device : devices) {
std::cout << "Device name: " << device.name() << ", type: " << device.device_type() << std::endl;
if (device.device_type() == "GPU") {
gpu_available = true;
}
}
if (gpu_available) {
std::cout << "GPU is available and can be used." << std::endl;
} else {
std::cout << "No GPU devices are available." << std::endl;
}
// Clean up
session->Close();
delete session;
return 0;
} |
Alternative version of #9241