Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TensorBoard doesn't load new data when summaries are stored in GCS. #867

Closed
mrfortynine opened this issue Jan 10, 2018 · 32 comments
Closed

Comments

@mrfortynine
Copy link

I have been storing summaries in GCS bucket, and running TensorBoard on a Compute Engine instance. This set up worked fairly well. Coming back from holiday break, I noticed that TensorBoard no longer reload new summary events. The only way for me to see new data is to restart TensorBoard process.

@nfelt
Copy link
Contributor

nfelt commented Jan 10, 2018

Do you happen to have the TensorBoard logs stored anywhere? And is the problem reproducible or did it just happen this once after waiting a long time?

@mrfortynine
Copy link
Author

I have been experiencing this for the last two hours and it's still on going . Where can I get TensorBoard log?

@nfelt
Copy link
Contributor

nfelt commented Jan 10, 2018

TensorBoard should log to standard out by default, so if you started it manually on Compute Engine it would just be logging to the console. If it's more customized it might be logging to /var/log or somewhere else, but I couldn't say for sure without knowing more about your specific setup.

How big is your logdir - is it fairly large? Sometimes very large or deeply nested logdirs can take so long to scan through that TensorBoard essentially appears to be frozen after it loads data for the first time.

@mrfortynine
Copy link
Author

The logdir is not large at all. The deepest level is two. I have no more than 200 files in total since I just started the training. The event file is about 80MB at this moment.

I used default TensorBoard installation on Compute Engine, nothing customized. I don't see any interesting logs other than this in console:

jia_pu@jia-tensorboard:~$ tensorboard --logdir gs://cruise-jia/ssd_training_samples_large
TensorBoard 0.4.0rc3 at http://jia-tensorboard:6006 (Press CTRL+C to quit)

And this is all I have in /var/log:

jia_pu@jia-tensorboard:~$ ls -l /var/log/
total 1576
-rw-r--r-- 1 root      root      0 Jan  1 06:46 alternatives.log
-rw-r--r-- 1 root      root   3134 Dec 19 21:54 alternatives.log.1
-rw-r----- 1 root      adm       0 Dec 25 06:42 apport.log
-rw-r----- 1 root      adm     393 Dec 25 04:28 apport.log.1
drwxr-xr-x 2 root      root   4096 Jan  1 06:46 apt
-rw-r----- 1 syslog    adm  149791 Jan 10 23:39 auth.log
-rw-r----- 1 syslog    adm  249024 Jan  7 06:47 auth.log.1
-rw-r----- 1 syslog    adm   25961 Dec 31 06:47 auth.log.2.gz
-rw-r----- 1 syslog    adm   19091 Dec 24 06:25 auth.log.3.gz
-rw-r--r-- 1 root      root  58424 Dec 19 21:46 boot.log
-rw-rw---- 1 root      utmp 263808 Jan 10 23:39 btmp
-rw-rw---- 1 root      utmp 273024 Jan  1 06:34 btmp.1
-rw-r--r-- 1 syslog    adm  122970 Dec 19 21:46 cloud-init.log
-rw-r--r-- 1 root      root   3833 Dec 19 21:46 cloud-init-output.log
drwxr-xr-x 2 root      root   4096 Nov 30  2016 dist-upgrade
-rw-r--r-- 1 root      adm   28531 Dec 19 21:46 dmesg
-rw-r--r-- 1 root      root      0 Dec 19 21:46 dmesg.0
-rw-r--r-- 1 root      root   3477 Jan 10 06:37 dpkg.log
-rw-r--r-- 1 root      root  65475 Dec 20 06:30 dpkg.log.1
drwxr-xr-x 2 root      root   4096 Dec  8 17:57 fsck
-rw-r----- 1 syslog    adm       0 Dec 31 06:49 kern.log
-rw-r----- 1 syslog    adm   14192 Dec 25 04:28 kern.log.1
-rw-r----- 1 syslog    adm    9926 Dec 20 06:29 kern.log.2.gz
drwxr-xr-x 2 landscape root   4096 Dec 19 21:46 landscape
-rw-rw-r-- 1 root      utmp 292584 Jan 10 21:30 lastlog
drwxr-xr-x 2 ntp       ntp    4096 Jul 12 12:23 ntpstats
-rw-r----- 1 syslog    adm    2141 Jan 10 23:17 syslog
-rw-r----- 1 syslog    adm    3307 Jan 10 06:25 syslog.1
-rw-r----- 1 syslog    adm     517 Jan  9 06:25 syslog.2.gz
-rw-r----- 1 syslog    adm     490 Jan  8 06:25 syslog.3.gz
-rw-r----- 1 syslog    adm     531 Jan  7 06:47 syslog.4.gz
-rw-r----- 1 syslog    adm     490 Jan  6 06:25 syslog.5.gz
-rw-r----- 1 syslog    adm     479 Jan  5 06:25 syslog.6.gz
-rw-r----- 1 syslog    adm     497 Jan  4 06:25 syslog.7.gz
-rw-r--r-- 1 root      root 179834 Dec 19 21:46 udev
drwxr-xr-x 2 root      root   4096 Jan 10 06:35 unattended-upgrades
drwxr-xr-x 2 root      root   4096 Dec 21 06:44 upstart
-rw-rw-r-- 1 root      utmp   4992 Jan 10 21:30 wtmp
-rw-rw-r-- 1 root      utmp  13440 Dec 21 06:34 wtmp.1

@mrfortynine
Copy link
Author

Also, I should mention that this happens even when I run tensorboard on my local machine.

@jart
Copy link
Contributor

jart commented Jan 11, 2018

When TensorBoard reads individual event log files, it polls the end of the file in case records get appended. That behavior isn't possible on GCS since objects are sort of immutable and atomic.

Out of curiosity, when you store the newer summaries, do you put them in a second event log file, with a higher timestamp?

@mrfortynine
Copy link
Author

Here is the code I use to write summaries. Does this answer your question? It seems a new event file gets created only when current event file has grown to a certain size. Maybe tensorboard only updates when the second event file is created. However, my impression is that tensorboard updates much more frequently than what I'm seeing right now back in December last year.

@mrfortynine
Copy link
Author

I think the multiple events files are due to restarting training. Assuming that’s the case, I’m fairly certain that I have seen tensorboard correctly reloading new data from a single event file stored on GCS as recent as December last year.

@chihuahua
Copy link
Member

Are you training with multiple threads/processes? If so, could you ensure that only 1 thread is writing summaries to disk? Specifically, ensure that only 1 thread has is_chief set to true here?
https://github.com/tensorflow/models/blob/master/research/object_detection/trainer.py#L323

Also, that the master argument is correct?

TensorBoard fails to pick up new summaries if multiple FileWriters concurrently write to disk.

@mrfortynine
Copy link
Author

I can take a closer look at the the script. The object detection project comes with samples that write events to GCS. I'm just using their scripts. However I want to emphasize that this "used to work" as recent as last month. I haven't changed anything in training script. Is it possible that something in GCS has changed that causes this?

@chihuahua
Copy link
Member

Hmm, you noted that this happens even when running tensorboard on a local machine, right? Furthermore, when you restart TensorBoard, you do see the most recent events reflected, right? To me, that seems to hint at a problem on the data-reading side.

@mrfortynine
Copy link
Author

Yes to both of you questions. By "data-reading" side, do you mean tensorboard or even GCS SDK?

@chihuahua
Copy link
Member

Oh ... maybe. By data-reading, I mean TensorBoard logic that reads data from disk to render in the web app. GCS logic seems related, but it seems less of a culprit if we can restart TensorBoard and see updated data. That means we can successfully read from GCS.

@mrfortynine
Copy link
Author

The default TensorBoard version on Compute Engine hasn't changed between December and January this year, right?

@chihuahua
Copy link
Member

Could you please note how you are running the scripts? Which shell commands? Are several commands running at once?

@chihuahua
Copy link
Member

True. The bug in which TensorBoard does not pick up data from multiple event writers has existed for a long time though.

@mrfortynine
Copy link
Author

I will post the exact command line shortly. Here's the job description ML engine sees. I'm pretty certain that I only submitted the job once.

Training input	
{
  "scaleTier": "CUSTOM",
  "masterType": "complex_model_l_gpu",
  "workerType": "complex_model_l_gpu",
  "parameterServerType": "standard",
  "workerCount": "5",
  "parameterServerCount": "3",
  "packageUris": [
    "gs://cruise-jia/ssd_training_samples_large/train/packages/4f140252d755df5e9f1cc89580e41425745409ceff2fbf951b80c07eb74aeaf5/object_detection-0.1.tar.gz",
    "gs://cruise-jia/ssd_training_samples_large/train/packages/4f140252d755df5e9f1cc89580e41425745409ceff2fbf951b80c07eb74aeaf5/slim-0.1.tar.gz"
  ],
  "pythonModule": "object_detection.train",
  "args": [
    "--train_dir",
    "gs://cruise-jia/ssd_training_samples_large/train",
    "--pipeline_config_path",
    "gs://cruise-jia/ssd_training_samples_large/cloud.config"
  ],
  "region": "us-central1",
  "runtimeVersion": "1.2",
  "jobDir": "gs://cruise-jia/ssd_training_samples_large/train"
}

@chihuahua
Copy link
Member

chihuahua commented Jan 12, 2018

Thank you. One thing that I'm trying to figure out is where this train method is called.
https://github.com/tensorflow/models/blob/master/research/object_detection/trainer.py#L172

Could we try something? At this call to train:
https://github.com/tensorflow/models/blob/master/research/object_detection/trainer.py#L319

Could we pass it a summary_writer parameter? Its value would be:

if is_chief:
  summary_writer = tf.summary.FileWriter(train_dir, graph=tf.get_default_graph())
else:
  summary_writer = None

@mrfortynine
Copy link
Author

Thank you for looking into this. I will try it out. It may take me a couple of days as I will be traveling. Will report back.

@jart
Copy link
Contributor

jart commented Jan 16, 2018

Oh it's possible you might have accidentally installed tensorflow on pypi rather than tensorflow-tensorboard. It's a problem we're going to be solving soon. Try just pip install tb-nightly.

@jart jart closed this as completed Jan 16, 2018
@robieta
Copy link

robieta commented May 18, 2018

Is this likely to be resolved? I'm experiencing this failure mode (TensorBoard does not detect new scalar information on GCS) even with the latest tb-nightly.

@jart
Copy link
Contributor

jart commented May 18, 2018

@robieta We've been bug scrubbing this week to improve GCS support. TensorBoard nightly users can expect to see --logdir=gs://... performance and reliability greatly improve in the upcoming weeks.

@nfelt Could this be related to the truncation issue you encountered a few days ago in the TF GCS impl? If so, please reopen and assign.

@nfelt
Copy link
Contributor

nfelt commented May 19, 2018

@robieta There's a known issue where in some circumstances TensorBoard won't load fresh data from GCS. This would happen only in a case where you have 1 or 2 separate run subdirectories containing event files within your logdir (if you have 3 or more, you hit a different problem, but it shouldn't get stuck loading).

If that matches your situation and you're willing to use tf-nightly in addition to tb-nightly you can trying a workaround: set the environment variable GCS_READ_CACHE_MAX_SIZE_MB=0 when launching TensorBoard, e.g. GCS_READ_CACHE_MAX_SIZE_MB=0 tensorboard --logdir foo.

If that still doesn't help, could you add the environment variable TF_CPP_MIN_VLOG_LEVEL=2 and then provide the resulting log output?

@robieta
Copy link

robieta commented May 19, 2018

@nfelt Thanks for the response. I am using tf-nightly as well as tb-nightly.

Setting GCS_READ_CACHE_MAX_SIZE_MB=0 did indeed solve the issue. The specifics might be useful to you: my directory contains two events files (both in the root of the directory):

events.out.tfevents.1526692219.taylorrobie-sandbox-0
This is a static file containing graphs (and I think projectors) that is written at the start of an epoch and not really touched (at least for the timescales here), and it is ~8.5 mb.

events.out.tfevents.1526692287.n-461190de-w-0.host_call
This file contains two scalar summaries, and is updated every few seconds. It is ~3.5 mb at the time of writing.

The reason I think this might be useful to you is that GCS_READ_CACHE_MAX_SIZE_MB=3 works and GCS_READ_CACHE_MAX_SIZE_MB=4 does not update.

Let me know if you need anything else.

@nfelt
Copy link
Contributor

nfelt commented Jun 5, 2018

This should be fixed by #1226, so in the upcoming 1.9 release of TensorBoard, provided that you're using TensorFlow 1.9+ as well. See #1225 for details.

@nfelt nfelt closed this as completed Jun 5, 2018
@dichen-cd
Copy link

The problem still exists on Tensorboard 1.10.0.

The log files are generated on the remote server. I mounted the log path to local machine via cifs and opened a tensorboard server also on the local machine.

A stupid yet effecticve workaround is to restart tensorboard server every t seconds:

#!/bin/bash
trap ctrl_c INT
function ctrl_c() {
        echo 
	echo "Ctrl-C by user"
	kill -9 $pid
	exit
}
args=$@
while true
do
	eval "tensorboard $args > /dev/null 2>&1 &"
	pid=$!
	echo "tensorboard running on $pid"
	sleep 20
	kill -9 $pid
done

Name this file as tensorboard_daemon and put it at the same path as tensorboard.

@nfelt
Copy link
Contributor

nfelt commented Aug 22, 2018

@DeanChan I'm not sure how cifs works and I'm also not sure how you're using it with GCS - more detail would be helpful. But if you're not reading directly from GCS itself, e.g. with tensorboard --logdir gs://bucket/logdir, then the issue fixed here is a different issue.

It seems more likely that the problem you're seeing is #349, a known issue where TensorBoard doesn't detect if event files it already has opened are replaced with files containing new data. If that's the case, the only fix right now is changing the syncing/mounting logic to ensure that the same files are used, e.g. for rsync specifying the --inplace option.

If you think it's not issue #349, please open a new GitHub issue for this problem specifically.

@dichen-cd
Copy link

@nfelt Many thanks for your reply.

I was not reading directly from GCS but reading the log files as they were on my local machine, i.e. tensorboard --logdir ./logs.

Currently I'm not sure if cifs supports inplace like rsync does. I'll give it a try.

@zippeurfou
Copy link

I think I am having the same issue with s3 files with multiple model.
It works well when I push the s3 files to my local machine but not on s3.

@KennethKJ
Copy link

Is this issue still monitored despite being closed? Do I need to open a new issue?

I have encountered the same issue. I am using Tensorflow 1.13 in a Google Datalab notebook run on a vm on the compute engine. I initialize TensorBord the following way:

from google.datalab.ml import TensorBoard
TensorBoard().start("gs://mypath_to_estimator_output_dir")

I am assuming that google.datalab.ml is calling the TensorBoard version installed with TensorFlow 1.13, right?

An estimator is run on AI platform via gcloud and it stores its summary etc in a directory on Google Storage specified as its output dir and I start tensorboard specifying this output dir. It works but it never refreshes and I have to start a new instance to see the newest steps.

Thanks in advance for any help

@nfelt
Copy link
Contributor

nfelt commented May 15, 2019

@KennethKJ Do you have multiple "tfevents" summary files generated in a single directory by your estimator? If so you may be hitting #1063 where only the last summary file to be created will be polled for new data after the initial pass over the files.

If that doesn't sound like it, then please open a new issue.

@KennethKJ
Copy link

Thanks a lot for your quick reply! Yes, looks like that is indeed the case. It seems that the default behavior of the Estimator is to store multiple tfevent files directly in the specified output directory. I will refer to issue #1063 instead and comment there if necessary for my use case. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants