-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tensorboard not displaying scalars #26
Comments
Hi @vassiasim, I am investigating your issue. Would you mind sharing the OS platform and a code block allowing me to reprove the issue in the same way that you did? Thanks for using SageMaker! |
Hi @mvsusp , Thank you for having a look!! The example provided here gives me the exact same behaviour: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/tensorflow_resnet_cifar10_with_tensorboard/tensorflow_resnet_cifar10_with_tensorboard.ipynb I am using |
Thanks @vassiasim. I will reproduce the issue and come back with further information. |
Hello @vassiasim, I created a SageMaker workspace and executed the resnet cifar 10 jupyter notebook. The behavior when The second call of TensorBoard scalars are created after each evaluation. The notebook example that you used only evaluates once in the end of the training, which explains the current behavior. I will change the example defined in tensorflow_resnet_cifar10_with_tensorboard.ipynb as follow: from sagemaker.tensorflow import TensorFlow
source_dir = os.path.join(os.getcwd(), 'source_dir')
estimator = TensorFlow(entry_point='resnet_cifar_10.py',
source_dir=source_dir,
role=role,
hyperparameters={ 'min_eval_frequency': 10},
training_steps=1000, evaluation_steps=100,
train_instance_count=2, train_instance_type='ml.c4.xlarge',
base_job_name='tensorboard-example')
estimator.fit(inputs, run_tensorboard_locally=True) That will change the training job behavior to evaluate the training job more often, allowing you see the scalars in the example. That will make the training job slower as well, given that more checkpoints will be saved to the S3 bucket. Please, change you notebook with code block above. For more information on how For more information on the available hyperparameters for TensorFlow sagemaker python sdk: https://github.com/aws/sagemaker-python-sdk#optional-hyperparameters PR with the code changes: aws/amazon-sagemaker-examples#154 |
Hi @mvsusp, Thank you for the reply and PR! That makes sense. I tried it and get the Any thoughts? Thanks a lot. |
Hi @vassiasim, You can increase the number of training steps to have a longer training and more |
Hi @mvsusp , Thank you for your reply. Yes agree, doing that will result to longer training, but i still don't understand why I don't get more points on the graph. Based on your previous answer, adding Using the notebook with I am just trying to understand how it works exactly and also to be able to see the progress of training. Thank you! |
I'm digging in on this issue right now since we need TensorBoard to work in order to make use of SageMaker. It seems like TensorBoard only ever sees the first event that gets written to the
I'm still not 100% sure what the cause is. But I have found a couple things. First, someone else was having a very similar problem when syncing logs from Google Cloud Storage to a local directory. They filed an issue in the TensorBoard repo here. When I run the hack they suggest ( But I also found another problem. When I start TensorBoard separately, I can see what it logs, and at one point I got the following error:
As soon as TensorBoard sees a file that's lexicographically higher than the event file it's supposed to be watching, it moves on to the new one and never looks at the original again. The I hope this helps. Unless I can figure out a more elegant solution in the next couple hours, my plan is to implement a hack where I keep two copies of the logs directory. One for |
Update: it looks like files don't actually have to be updated in place for TensorBoard to pick them up. That would narrow it down to the temporary files that I need to do some testing with a non-toy example, but I have a quick fix that seems to be working for CIFAR 10 here. If I'm right about the actual issue, then I think the best solution is probably to convince TensorBoard to expose their path filter so it can be used in the CLI. In that case SageMaker could just tell TensorBoard to ignore files with anything after Another option would be to sync files from S3 without creating temporary files in the same directory. That would be easy enough if you can download the whole So far, I've got the easiest solution which is to keep 2 copies of the log directory locally. I'm happy to help out with a better fix if someone can comment about a preferred approach. I'll also follow up if I find out my current solution is not working. |
All, any update on this issue? |
I've ran the example mentioned in this thread. I made sure there will be more data produced by setting 'save_summary_secs' and 'save_checkpoints_secs' with just a few seconds. Running with 1 training instance will work correctly and tensofboard is being refreshed as training goes by. |
Just checking in to see if there are any updates or any indication of when this will be fixed. |
With further investigation, I confirmed there's a problem even for single-machine training. (The problem @lukmis mentioned may be a separate one we have to investigate). I believe the problem is the same one described here: tensorflow/tensorboard#349 We use aws s3 sync https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/estimator.py#L100 , which would cause the same problem as the gsutil sync. I'll try out @jbencook's fix next. |
@jbencook I tried out your fix and it's working. I think the overall approach is the right one - the main difference I'd suggest is simply to use a context manager to clean up the temporary directory after its contents are copied to the tensorflow log dir. Would you be interested in submitting a PR? |
Good point - I'm not currently cleaning up the intermediate directory. I can fix that up and submit a PR later today. |
@jbencook Thank you so much for your contribution! Until the next SDK release, the Tensorboard fix can be viewed by building and installing from master. It is also possible to view the fix within a SageMaker notebook instance by building and installing from source.
All tensorflow jobs that run tensorboard should now correctly display scalars! Feel free to run the sample tensorboard notebook, tensorflow_resnet_cifar10_with_tensorboard , which is in /sample-notebooks/sagemaker-python-sdk. Thanks again! |
Hello, @jbencook Thank you so much for your contribution. This fix has been released. |
tensorboard example notebook
When the flag
run_tensorboard_locally
is set toTrue
, for exampleestimator.fit(inputs, run_tensorboard_locally=True)
, whereestimator = TensorFlow(..)
,Tensorboard only displays the graph and projector but not any scalars or images.
If one run is terminated and a new one is started by running again:
estimator.fit(inputs, run_tensorboard_locally=True)
then the scalars and images of the previous run are displayed on Tensorboard but they are not updated as training continues.
It seems like it, when training is restarted, Tensorboard loads the previously saved logs from the
/tmp/<temp_folder>/
, which was created bytempfile.mkdtemp()
, but the new logs are then saved to a newly created folder.Any way to get Tensorboard working properly?
Would it make sense to add the ability to define
logdir
forTensorboard
when callingTensorFlow
?The text was updated successfully, but these errors were encountered: