Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensorboard not displaying scalars #26

Closed
vassiasim opened this issue Dec 20, 2017 · 18 comments
Closed

Tensorboard not displaying scalars #26

vassiasim opened this issue Dec 20, 2017 · 18 comments
Labels
bug status: pending release The fix have been merged but not yet released to PyPI

Comments

@vassiasim
Copy link

vassiasim commented Dec 20, 2017

When the flag run_tensorboard_locally is set to True , for example
estimator.fit(inputs, run_tensorboard_locally=True), where estimator = TensorFlow(..) ,
Tensorboard only displays the graph and projector but not any scalars or images.

If one run is terminated and a new one is started by running again:
estimator.fit(inputs, run_tensorboard_locally=True)
then the scalars and images of the previous run are displayed on Tensorboard but they are not updated as training continues.
It seems like it, when training is restarted, Tensorboard loads the previously saved logs from the /tmp/<temp_folder>/ , which was created by tempfile.mkdtemp(), but the new logs are then saved to a newly created folder.

Any way to get Tensorboard working properly?
Would it make sense to add the ability to define logdir for Tensorboard when calling TensorFlow?

@mvsusp
Copy link
Contributor

mvsusp commented Dec 21, 2017

Hi @vassiasim,

I am investigating your issue. Would you mind sharing the OS platform and a code block allowing me to reprove the issue in the same way that you did?

Thanks for using SageMaker!

@vassiasim
Copy link
Author

Hi @mvsusp ,

Thank you for having a look!!

The example provided here gives me the exact same behaviour: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/tensorflow_resnet_cifar10_with_tensorboard/tensorflow_resnet_cifar10_with_tensorboard.ipynb

I am using aws, so Ubuntu 16.4.

@mvsusp
Copy link
Contributor

mvsusp commented Dec 21, 2017

Thanks @vassiasim. I will reproduce the issue and come back with further information.

@mvsusp
Copy link
Contributor

mvsusp commented Dec 22, 2017

Hello @vassiasim,

I created a SageMaker workspace and executed the resnet cifar 10 jupyter notebook. The behavior when estimator.fit(inputs, run_tensorboard_locally=True) is called is almost exactly as you described. The only difference is that the scalars are being displayed in the end of the training job and quickly after TensorBoard is deactivated.

The second call of estimator.fit(inputs, run_tensorboard_locally=True) will create a second training job but it will use the same checkpoints from the previous execution. That is an useful feature and can be avoided creating a new TensorFlow estimator instead of using the previous one. When the second call of estimator.fit(inputs, run_tensorboard_locally=True) starts, TensorBoard will pick the state from the previous run, which is why you see the scalars there.

TensorBoard scalars are created after each evaluation. The notebook example that you used only evaluates once in the end of the training, which explains the current behavior.

I will change the example defined in tensorflow_resnet_cifar10_with_tensorboard.ipynb as follow:

from sagemaker.tensorflow import TensorFlow


source_dir = os.path.join(os.getcwd(), 'source_dir')
estimator = TensorFlow(entry_point='resnet_cifar_10.py',
                       source_dir=source_dir,
                       role=role,
                       hyperparameters={ 'min_eval_frequency': 10},
                       training_steps=1000, evaluation_steps=100,
                       train_instance_count=2, train_instance_type='ml.c4.xlarge', 
                       base_job_name='tensorboard-example')

estimator.fit(inputs, run_tensorboard_locally=True)

That will change the training job behavior to evaluate the training job more often, allowing you see the scalars in the example. That will make the training job slower as well, given that more checkpoints will be saved to the S3 bucket.

Please, change you notebook with code block above.

For more information on how TensorFlow training process works: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/docs_src/get_started/estimator.md#fit-the-dnnclassifier-to-the-iris-training-data-fit-dnnclassifier

For more information on the available hyperparameters for TensorFlow sagemaker python sdk: https://github.com/aws/sagemaker-python-sdk#optional-hyperparameters

PR with the code changes: aws/amazon-sagemaker-examples#154

@vassiasim
Copy link
Author

vassiasim commented Dec 23, 2017

Hi @mvsusp,

Thank you for the reply and PR!

That makes sense. I tried it and get the scalars and images displayed, but only when the job is complete even though evaluation is run few times before that.

Any thoughts?

Thanks a lot.

@mvsusp
Copy link
Contributor

mvsusp commented Dec 27, 2017

Hi @vassiasim,

You can increase the number of training steps to have a longer training and more scalars and images displayed as well. Try training_steps=10000.

@vassiasim
Copy link
Author

Hi @mvsusp ,

Thank you for your reply. Yes agree, doing that will result to longer training, but i still don't understand why I don't get more points on the graph.

Based on your previous answer, adding min_eval_frequency runs the evaluation more often and not every 1000 steps that is the default, so I would expect a point on Tensorboard every time the evaluation is run. Just trying to understand when Tensorboard updates the displayed values for training and for evaluation.

Using the notebook with training_steps=10000 and hyperparameters={ 'min_eval_frequency': 10} , I get points for training up to 600 and one point for evaluation at 1 after passing step=1000. However, I would expect the Tensorboard to get updated, after passing step=2000, but it doesn't. It only got updated at some random point after step=3000 but again only for training (still just getting one point for evaluation at 1). It doesn't update again until the end of the training at 10000 steps.

I am just trying to understand how it works exactly and also to be able to see the progress of training.

Thank you!

@jbencook
Copy link

I'm digging in on this issue right now since we need TensorBoard to work in order to make use of SageMaker. It seems like TensorBoard only ever sees the first event that gets written to the .tfevents file and then it stops updating. Steps to reproduce the issue:

  1. Run the CIFAR10 example
  2. After the job starts training, refresh the TensorBoard page. At this point, TensorBoard shows me one point on the accuracy plot under the scalars tab.
  3. Keep an eye on the .tfevents file in the log directory by running ls -la every 30 seconds or so. When it updates (you know because the timestamp changed), start a new TensorBoard on a different port pointing to the same log directory. You can see that the new TensorBoard registers the new event, while the existing TensorBoard still ignores it. I've got a screenshot with side-by-side TensorBoards below. Both sessions are pointing to the same log directory, but running on different ports.

screen shot 2018-01-19 at 12 36 03 pm

I'm still not 100% sure what the cause is. But I have found a couple things. First, someone else was having a very similar problem when syncing logs from Google Cloud Storage to a local directory. They filed an issue in the TensorBoard repo here. When I run the hack they suggest (rsync --inplace) the logs seem to mostly update correctly.

But I also found another problem. When I start TensorBoard separately, I can see what it logs, and at one point I got the following error:

E0122 08:56:36.886072 Reloader directory_watcher.py:241] File /private/var/folders/_t/ywxyc4gs5gv10xx01cj12ps80000gn/T/tmp2jTUUg/events.out.tfevents.1516632057.aws updated even though the current file is /private/var/folders/_t/ywxyc4gs5gv10xx01cj12ps80000gn/T/tmp2jTUUg/events.out.tfevents.1516632057.aws.8af0e520

As soon as TensorBoard sees a file that's lexicographically higher than the event file it's supposed to be watching, it moves on to the new one and never looks at the original again. The aws s3 sync command creates temporary files that are lexicographically higher, so as soon as TensorBoard sees those, it stops watching the correct tfevents file.

I hope this helps. Unless I can figure out a more elegant solution in the next couple hours, my plan is to implement a hack where I keep two copies of the logs directory. One for aws s3 sync to update and the other one for TensorBoard to watch. That will work fine for us, but a proper solution will probably be more involved.

@jbencook
Copy link

Update: it looks like files don't actually have to be updated in place for TensorBoard to pick them up. That would narrow it down to the temporary files that aws s3 sync creates.

I need to do some testing with a non-toy example, but I have a quick fix that seems to be working for CIFAR 10 here.

If I'm right about the actual issue, then I think the best solution is probably to convince TensorBoard to expose their path filter so it can be used in the CLI. In that case SageMaker could just tell TensorBoard to ignore files with anything after .aws.

Another option would be to sync files from S3 without creating temporary files in the same directory. That would be easy enough if you can download the whole .tfevent file in one call but I'm guessing those can get big.

So far, I've got the easiest solution which is to keep 2 copies of the log directory locally. I'm happy to help out with a better fix if someone can comment about a preferred approach.

I'll also follow up if I find out my current solution is not working.

@jpbarto
Copy link

jpbarto commented Feb 17, 2018

All, any update on this issue?

@lukmis
Copy link
Contributor

lukmis commented Feb 20, 2018

I've ran the example mentioned in this thread. I made sure there will be more data produced by setting 'save_summary_secs' and 'save_checkpoints_secs' with just a few seconds.
I noticed that indeed the originally launched tensorboard was not refreshing but starting a separate tensorboard even on the same local directory was showing data. I also ran 'tensorboard --inspect' to see the data was there.
I also realized that the example code doesn't do any special handling of the 'tf.summary.FileWriter' which made me try to run with only 1 instance.

Running with 1 training instance will work correctly and tensofboard is being refreshed as training goes by.

@hsakkout
Copy link

Just checking in to see if there are any updates or any indication of when this will be fixed.

@winstonaws
Copy link
Contributor

@hsakkout Right now we're prioritizing this against a host of tensorflow serving-related bugs that have been discovered recently, e.g.: #99

Expect an update by EOD wednesday about where we landed.

@winstonaws
Copy link
Contributor

winstonaws commented Mar 21, 2018

With further investigation, I confirmed there's a problem even for single-machine training. (The problem @lukmis mentioned may be a separate one we have to investigate).

I believe the problem is the same one described here: tensorflow/tensorboard#349

We use aws s3 sync https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/estimator.py#L100 , which would cause the same problem as the gsutil sync.

I'll try out @jbencook's fix next.

@winstonaws
Copy link
Contributor

@jbencook I tried out your fix and it's working. I think the overall approach is the right one - the main difference I'd suggest is simply to use a context manager to clean up the temporary directory after its contents are copied to the tensorflow log dir.

Would you be interested in submitting a PR?

@jbencook
Copy link

Good point - I'm not currently cleaning up the intermediate directory. I can fix that up and submit a PR later today.

@ChoiByungWook ChoiByungWook added the status: pending release The fix have been merged but not yet released to PyPI label Mar 22, 2018
@ChoiByungWook
Copy link
Contributor

@jbencook Thank you so much for your contribution!

Until the next SDK release, the Tensorboard fix can be viewed by building and installing from master.

It is also possible to view the fix within a SageMaker notebook instance by building and installing from source.

  1. Start a new conda_tensorflow_p27 notebook
  2. Clone from master and pip install within the cell
! git clone https://github.com/aws/sagemaker-python-sdk.git python-sdk-tensorboard-fix && cd python-sdk-tensorboard-fix && pip install . --upgrade
  1. Run the cell

All tensorflow jobs that run tensorboard should now correctly display scalars!

Feel free to run the sample tensorboard notebook, tensorflow_resnet_cifar10_with_tensorboard , which is in /sample-notebooks/sagemaker-python-sdk.

Thanks again!

@ChoiByungWook
Copy link
Contributor

Hello,

@jbencook Thank you so much for your contribution.

This fix has been released.

laurenyu pushed a commit to laurenyu/sagemaker-python-sdk that referenced this issue May 31, 2018
apacker pushed a commit to apacker/sagemaker-python-sdk that referenced this issue Nov 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug status: pending release The fix have been merged but not yet released to PyPI
Projects
None yet
Development

No branches or pull requests

9 participants