Tensorboard not displaying scalars #26

vassiasim · 2017-12-20T11:10:23Z

When the flag run_tensorboard_locally is set to True , for example
estimator.fit(inputs, run_tensorboard_locally=True), where estimator = TensorFlow(..) ,
Tensorboard only displays the graph and projector but not any scalars or images.

If one run is terminated and a new one is started by running again:
estimator.fit(inputs, run_tensorboard_locally=True)
then the scalars and images of the previous run are displayed on Tensorboard but they are not updated as training continues.
It seems like it, when training is restarted, Tensorboard loads the previously saved logs from the /tmp/<temp_folder>/ , which was created by tempfile.mkdtemp(), but the new logs are then saved to a newly created folder.

Any way to get Tensorboard working properly?
Would it make sense to add the ability to define logdir for Tensorboard when calling TensorFlow?

The text was updated successfully, but these errors were encountered:

mvsusp · 2017-12-21T22:01:31Z

Hi @vassiasim,

I am investigating your issue. Would you mind sharing the OS platform and a code block allowing me to reprove the issue in the same way that you did?

Thanks for using SageMaker!

vassiasim · 2017-12-21T22:11:53Z

Hi @mvsusp ,

Thank you for having a look!!

The example provided here gives me the exact same behaviour: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/tensorflow_resnet_cifar10_with_tensorboard/tensorflow_resnet_cifar10_with_tensorboard.ipynb

I am using aws, so Ubuntu 16.4.

mvsusp · 2017-12-21T22:45:43Z

Thanks @vassiasim. I will reproduce the issue and come back with further information.

mvsusp · 2017-12-22T17:46:55Z

Hello @vassiasim,

I created a SageMaker workspace and executed the resnet cifar 10 jupyter notebook. The behavior when estimator.fit(inputs, run_tensorboard_locally=True) is called is almost exactly as you described. The only difference is that the scalars are being displayed in the end of the training job and quickly after TensorBoard is deactivated.

The second call of estimator.fit(inputs, run_tensorboard_locally=True) will create a second training job but it will use the same checkpoints from the previous execution. That is an useful feature and can be avoided creating a new TensorFlow estimator instead of using the previous one. When the second call of estimator.fit(inputs, run_tensorboard_locally=True) starts, TensorBoard will pick the state from the previous run, which is why you see the scalars there.

TensorBoard scalars are created after each evaluation. The notebook example that you used only evaluates once in the end of the training, which explains the current behavior.

I will change the example defined in tensorflow_resnet_cifar10_with_tensorboard.ipynb as follow:

from sagemaker.tensorflow import TensorFlow


source_dir = os.path.join(os.getcwd(), 'source_dir')
estimator = TensorFlow(entry_point='resnet_cifar_10.py',
                       source_dir=source_dir,
                       role=role,
                       hyperparameters={ 'min_eval_frequency': 10},
                       training_steps=1000, evaluation_steps=100,
                       train_instance_count=2, train_instance_type='ml.c4.xlarge', 
                       base_job_name='tensorboard-example')

estimator.fit(inputs, run_tensorboard_locally=True)

That will change the training job behavior to evaluate the training job more often, allowing you see the scalars in the example. That will make the training job slower as well, given that more checkpoints will be saved to the S3 bucket.

Please, change you notebook with code block above.

For more information on how TensorFlow training process works: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/docs_src/get_started/estimator.md#fit-the-dnnclassifier-to-the-iris-training-data-fit-dnnclassifier

For more information on the available hyperparameters for TensorFlow sagemaker python sdk: https://github.com/aws/sagemaker-python-sdk#optional-hyperparameters

PR with the code changes: aws/amazon-sagemaker-examples#154

vassiasim · 2017-12-23T17:15:29Z

Hi @mvsusp,

Thank you for the reply and PR!

That makes sense. I tried it and get the scalars and images displayed, but only when the job is complete even though evaluation is run few times before that.

Any thoughts?

Thanks a lot.

mvsusp · 2017-12-27T19:31:00Z

Hi @vassiasim,

You can increase the number of training steps to have a longer training and more scalars and images displayed as well. Try training_steps=10000.

vassiasim · 2018-01-02T15:04:10Z

Hi @mvsusp ,

Thank you for your reply. Yes agree, doing that will result to longer training, but i still don't understand why I don't get more points on the graph.

Based on your previous answer, adding min_eval_frequency runs the evaluation more often and not every 1000 steps that is the default, so I would expect a point on Tensorboard every time the evaluation is run. Just trying to understand when Tensorboard updates the displayed values for training and for evaluation.

Using the notebook with training_steps=10000 and hyperparameters={ 'min_eval_frequency': 10} , I get points for training up to 600 and one point for evaluation at 1 after passing step=1000. However, I would expect the Tensorboard to get updated, after passing step=2000, but it doesn't. It only got updated at some random point after step=3000 but again only for training (still just getting one point for evaluation at 1). It doesn't update again until the end of the training at 10000 steps.

I am just trying to understand how it works exactly and also to be able to see the progress of training.

Thank you!

jbencook · 2018-01-22T15:35:13Z

I'm digging in on this issue right now since we need TensorBoard to work in order to make use of SageMaker. It seems like TensorBoard only ever sees the first event that gets written to the .tfevents file and then it stops updating. Steps to reproduce the issue:

Run the CIFAR10 example
After the job starts training, refresh the TensorBoard page. At this point, TensorBoard shows me one point on the accuracy plot under the scalars tab.
Keep an eye on the .tfevents file in the log directory by running ls -la every 30 seconds or so. When it updates (you know because the timestamp changed), start a new TensorBoard on a different port pointing to the same log directory. You can see that the new TensorBoard registers the new event, while the existing TensorBoard still ignores it. I've got a screenshot with side-by-side TensorBoards below. Both sessions are pointing to the same log directory, but running on different ports.

I'm still not 100% sure what the cause is. But I have found a couple things. First, someone else was having a very similar problem when syncing logs from Google Cloud Storage to a local directory. They filed an issue in the TensorBoard repo here. When I run the hack they suggest (rsync --inplace) the logs seem to mostly update correctly.

But I also found another problem. When I start TensorBoard separately, I can see what it logs, and at one point I got the following error:

E0122 08:56:36.886072 Reloader directory_watcher.py:241] File /private/var/folders/_t/ywxyc4gs5gv10xx01cj12ps80000gn/T/tmp2jTUUg/events.out.tfevents.1516632057.aws updated even though the current file is /private/var/folders/_t/ywxyc4gs5gv10xx01cj12ps80000gn/T/tmp2jTUUg/events.out.tfevents.1516632057.aws.8af0e520

As soon as TensorBoard sees a file that's lexicographically higher than the event file it's supposed to be watching, it moves on to the new one and never looks at the original again. The aws s3 sync command creates temporary files that are lexicographically higher, so as soon as TensorBoard sees those, it stops watching the correct tfevents file.

I hope this helps. Unless I can figure out a more elegant solution in the next couple hours, my plan is to implement a hack where I keep two copies of the logs directory. One for aws s3 sync to update and the other one for TensorBoard to watch. That will work fine for us, but a proper solution will probably be more involved.

jbencook · 2018-01-22T20:30:36Z

Update: it looks like files don't actually have to be updated in place for TensorBoard to pick them up. That would narrow it down to the temporary files that aws s3 sync creates.

I need to do some testing with a non-toy example, but I have a quick fix that seems to be working for CIFAR 10 here.

If I'm right about the actual issue, then I think the best solution is probably to convince TensorBoard to expose their path filter so it can be used in the CLI. In that case SageMaker could just tell TensorBoard to ignore files with anything after .aws.

Another option would be to sync files from S3 without creating temporary files in the same directory. That would be easy enough if you can download the whole .tfevent file in one call but I'm guessing those can get big.

So far, I've got the easiest solution which is to keep 2 copies of the log directory locally. I'm happy to help out with a better fix if someone can comment about a preferred approach.

I'll also follow up if I find out my current solution is not working.

jpbarto · 2018-02-17T21:54:53Z

All, any update on this issue?

lukmis · 2018-02-20T23:01:11Z

I've ran the example mentioned in this thread. I made sure there will be more data produced by setting 'save_summary_secs' and 'save_checkpoints_secs' with just a few seconds.
I noticed that indeed the originally launched tensorboard was not refreshing but starting a separate tensorboard even on the same local directory was showing data. I also ran 'tensorboard --inspect' to see the data was there.
I also realized that the example code doesn't do any special handling of the 'tf.summary.FileWriter' which made me try to run with only 1 instance.

Running with 1 training instance will work correctly and tensofboard is being refreshed as training goes by.

hsakkout · 2018-03-19T18:28:36Z

Just checking in to see if there are any updates or any indication of when this will be fixed.

winstonaws · 2018-03-20T06:11:47Z

@hsakkout Right now we're prioritizing this against a host of tensorflow serving-related bugs that have been discovered recently, e.g.: #99

Expect an update by EOD wednesday about where we landed.

winstonaws · 2018-03-21T00:26:42Z

With further investigation, I confirmed there's a problem even for single-machine training. (The problem @lukmis mentioned may be a separate one we have to investigate).

I believe the problem is the same one described here: tensorflow/tensorboard#349

We use aws s3 sync https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/estimator.py#L100 , which would cause the same problem as the gsutil sync.

I'll try out @jbencook's fix next.

winstonaws · 2018-03-21T02:09:09Z

@jbencook I tried out your fix and it's working. I think the overall approach is the right one - the main difference I'd suggest is simply to use a context manager to clean up the temporary directory after its contents are copied to the tensorflow log dir.

Would you be interested in submitting a PR?

jbencook · 2018-03-21T11:34:36Z

Good point - I'm not currently cleaning up the intermediate directory. I can fix that up and submit a PR later today.

ChoiByungWook · 2018-03-23T02:03:10Z

@jbencook Thank you so much for your contribution!

Until the next SDK release, the Tensorboard fix can be viewed by building and installing from master.

It is also possible to view the fix within a SageMaker notebook instance by building and installing from source.

Start a new conda_tensorflow_p27 notebook
Clone from master and pip install within the cell

! git clone https://github.com/aws/sagemaker-python-sdk.git python-sdk-tensorboard-fix && cd python-sdk-tensorboard-fix && pip install . --upgrade

Run the cell

All tensorflow jobs that run tensorboard should now correctly display scalars!

Feel free to run the sample tensorboard notebook, tensorflow_resnet_cifar10_with_tensorboard , which is in /sample-notebooks/sagemaker-python-sdk.

Thanks again!

ChoiByungWook · 2018-03-29T00:50:28Z

Hello,

@jbencook Thank you so much for your contribution.

This fix has been released.

tensorboard example notebook

This was referenced Dec 22, 2017

Tensorboard not displaying scalars aws/amazon-sagemaker-examples#153

Closed

ResNet CIFAR 10 generates scalar data faster aws/amazon-sagemaker-examples#154

Merged

andremoeller added the type: question label Dec 27, 2017

winstonaws added bug and removed type: question labels Mar 14, 2018

jbencook mentioned this issue Mar 21, 2018

Fix TensorFlow log syncing bug #105

Merged

ChoiByungWook added the status: pending release The fix have been merged but not yet released to PyPI label Mar 22, 2018

ChoiByungWook closed this as completed Mar 29, 2018

laurenyu pushed a commit to laurenyu/sagemaker-python-sdk that referenced this issue May 31, 2018

Pass estimator class name as hyperparameter for tuning jobs (aws#26)

0b26117

apacker pushed a commit to apacker/sagemaker-python-sdk that referenced this issue Nov 15, 2018

Merge pull request aws#26 from awslabs/mvs-tensorboard

86c024e

tensorboard example notebook

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensorboard not displaying scalars #26

Tensorboard not displaying scalars #26

vassiasim commented Dec 20, 2017 •

edited

Loading

mvsusp commented Dec 21, 2017

vassiasim commented Dec 21, 2017

mvsusp commented Dec 21, 2017

mvsusp commented Dec 22, 2017

vassiasim commented Dec 23, 2017 •

edited

Loading

mvsusp commented Dec 27, 2017

vassiasim commented Jan 2, 2018

jbencook commented Jan 22, 2018

jbencook commented Jan 22, 2018

jpbarto commented Feb 17, 2018

lukmis commented Feb 20, 2018

hsakkout commented Mar 19, 2018

winstonaws commented Mar 20, 2018

winstonaws commented Mar 21, 2018 •

edited

Loading

winstonaws commented Mar 21, 2018

jbencook commented Mar 21, 2018

ChoiByungWook commented Mar 23, 2018

ChoiByungWook commented Mar 29, 2018

Tensorboard not displaying scalars #26

Tensorboard not displaying scalars #26

Comments

vassiasim commented Dec 20, 2017 • edited Loading

mvsusp commented Dec 21, 2017

vassiasim commented Dec 21, 2017

mvsusp commented Dec 21, 2017

mvsusp commented Dec 22, 2017

vassiasim commented Dec 23, 2017 • edited Loading

mvsusp commented Dec 27, 2017

vassiasim commented Jan 2, 2018

jbencook commented Jan 22, 2018

jbencook commented Jan 22, 2018

jpbarto commented Feb 17, 2018

lukmis commented Feb 20, 2018

hsakkout commented Mar 19, 2018

winstonaws commented Mar 20, 2018

winstonaws commented Mar 21, 2018 • edited Loading

winstonaws commented Mar 21, 2018

jbencook commented Mar 21, 2018

ChoiByungWook commented Mar 23, 2018

ChoiByungWook commented Mar 29, 2018

vassiasim commented Dec 20, 2017 •

edited

Loading

vassiasim commented Dec 23, 2017 •

edited

Loading

winstonaws commented Mar 21, 2018 •

edited

Loading