FireStore listener as a service stop working after a day #2721

nicksav · 2018-11-23T03:20:52Z

Version: Google.Cloud.Firestore-1.0.0-beta13

Guys, I am having some issues with FireStore real time updates. I have deployed a web app to Azure Web Site. On a start up, this application subscribes to the updates in the FireStore users table.
`

        GoogleCredential googleCredential = googleCredential = GoogleCredential.FromJson(JsonConvert.SerializeObject(secret));
        var channel = new Grpc.Core.Channel(FirestoreClient.DefaultEndpoint.Host,
            googleCredential.ToChannelCredentials());

        FirestoreClient client = FirestoreClient.Create(channel);
        FirestoreDb db = FirestoreDb.Create(projectId: "myproject-2018", client: client);

        CollectionReference collection = db.Collection("users");


        FirestoreChangeListener listener = collection.Listen(async querySnapshot =>
        {
        })

`

All works perfect, but after a couple hours, or sometimes days, it just stop receiving updates without any errors. Is there any way for me to debug it to understand why?

The text was updated successfully, but these errors were encountered:

jskeet · 2018-11-23T06:42:59Z

When you say "without any errors" - what happens to the task exposed as listener.ListenerTask? Does that:

Complete
Stay uncompleted
Become faulted

? That would be the first piece of interesting information.

In terms of further diagnostics, there's nothing particularly built into the library to help here, unfortunately. You could turn on gRPC tracing, but that would generate a huge amount of logging.

Does everything else in the application continue to work? In particular, are you doing other Firestore operations on the same Channel (or even with the same FirestoreClient)?

If we could work out a way of reproducing this in an isolated way, that would be ideal. Do you think having one console application which had basically the code you've already shown, and another that just updates a single user entry once every (say) 15 seconds, would be a good start? I'd be happy to create that and just leave it running.

jskeet · 2018-11-27T16:27:18Z

I've now written some test code. Will leave it running and see what happens:

using Google.Cloud.Firestore;
using System;
using System.Linq;
using System.Threading.Tasks;

namespace Issue2721
{
    class Program
    {
        static void Main(string[] args)
        {
            Task updatingTask = UpdateAsync(args[0]);
            Task listeningTask = ListenAsync(args[0]);
            updatingTask.Wait();
            listeningTask.Wait();
        }

        static async Task UpdateAsync(string projectId)
        {
            var db = FirestoreDb.Create(projectId);
            var document = db.Collection("Issue2721").Document("testdoc");
            while (true)
            {
                var now = DateTime.UtcNow;
                await document.SetAsync(new { LastUpdated = now });
                Log($"Updated to {now:HH:mm:ss.ffffff}");
                await Task.Delay(TimeSpan.FromSeconds(15));
            }
        }

        static async Task ListenAsync(string projectId)
        {
            var db = FirestoreDb.Create(projectId);
            CollectionReference collection = db.Collection("Issue2721");

            FirestoreChangeListener listener = collection.Listen(snapshot =>
            {
                var doc = snapshot.Documents.First();
                var value = doc.GetValue<DateTime>("LastUpdated");
                Log($"Observed value {value:HH:mm:ss.ffffff}");
            });
            await listener.ListenerTask;
        }

        static void Log(string message) =>
            Console.WriteLine($"{DateTime.UtcNow:HH:mm:ss.ffffff}: {message}");
    }
}

nicksav · 2018-12-04T10:51:43Z

Hey jskeet. Was working ok after some time? I will deploy your code to Azure Web Site and also will leave for couple days. Will update you on monday with results

jskeet · 2018-12-04T11:43:33Z

Eek! I sincerely apologise - I dropped the ball on this one. I started it running on my laptop first, and of course that dropped the connection when the laptop went to sleep.

Will get it running on a desktop, writing to a file so I can see what's going on...

jskeet · 2018-12-04T15:15:58Z

(It's been running for 4 hours now - no problems so far, but we'll see what happens over the next 24-48 hours...)

jskeet · 2018-12-05T07:22:00Z

First test died after 18 hours due to the "updating" code failing with a ServiceUnavailable error. Rerunning - I'd really like to see the exact behavior you described.

nicksav · 2018-12-05T10:08:02Z

Yep, also deployed to Azure. Will provide link tomorrow. So if its running on Azure how to prevent from getting ServiceUnavailable error? Just to re-attach the listener to continue listening for changes?

DenSmoke · 2018-12-05T11:05:13Z

I had the same issue but it disappeared by itself somehow. I mean listeners stopped to work without exception after about one day

jskeet · 2018-12-05T11:11:21Z

@DenSmoke: I'm not quite sure I understand what you mean by "it disappeared by itself" - the issue disappeared, or the listener disappeared? If you still have listeners that are stopping reporting changes (but without the task failing), that's what we're trying to reproduce.

@nicksav: The listener should reattach itself on ServiceUnavailable errors (I think - it certainly does for some errors; I'd need to check exactly which). In my case it was the code making the change that failed with that error.

DenSmoke · 2018-12-05T11:20:26Z

@jskeet I couldn't reproduce the issue again so "issue disappeared". It was on beta11

jskeet · 2018-12-06T08:31:47Z

The new test has been running for 25 hours with no problems so far. That doesn't mean it won't happen eventually though...

DenSmoke · 2018-12-07T12:10:58Z

Problem happen again in .NET Core 2.2 console app with Google.Cloud.Firestore beta14 on Ubuntu 18.04 running in Compute Engine f1-micro instance.

pauloevpr · 2018-12-07T12:37:32Z

I have a similar issue. I am running an Asp.Net Core 2.1 Web API on Google Cloud App Engine. The FirestoreListener stops working eventually. I can't say it is precisely after 24 hours. It seems just random.

From my analysis so far, it seems to be an issue with Firestore backend rather than the Firestore .NET SDK. Firestore randomly fails when reading/writing documents. It is not just about the listeners.

For example. Yesterday my Asp.Net Core Web API had the following exception when getting a document snapshot:

Grpc.Core.RpcException: Status(StatusCode=PermissionDenied, Detail="Missing or insufficient permissions.")

I have seen this at least once every month.

That is not the only one. I also have this one randomly once in a while when reading documents:

Grpc.Core.RpcException: Status(StatusCode=Unavailable, Detail="The datastore operation timed out, or the data was temporarily unavailable.")

My conclusion is that, as a beta product, it is possible that Firestore backend randomly fails to serve requests for whatever reason.

For the specific listener issue, my workaround is to have a function triggered by a timer which will restore the listener as required.

jskeet · 2018-12-07T12:57:32Z

Thanks for the updates everyone - and particularly for that extra detail, @pauloevpr.
My own test is still going strong after about 52 hours, so isn't giving any more information :(

I'll contact the Firestore internal team and point them at this issue. They may have some ideas about what's going on.

jskeet · 2018-12-10T07:53:00Z

Internal feedback suggests this may be a gRPC issue rather than a server side issue. Obviously it being hard to reproduce doesn't help us :( (The test I started 5 days ago is still going strong...)

@pauloevpr: I've filed #2760 for the Unavailable issue. The PermissionDenied issue is odder. Have you seen it more than just the once? If you see it again, could you send me an email ([email protected]) with your project ID and a timestamp, and we'll look in the logs?

nicksav · 2018-12-14T00:54:49Z

Ok left to run on Azure and again, stooped working after time.

jskeet · 2018-12-14T06:08:30Z

Thanks - will check mine!

jskeet · 2018-12-14T07:02:05Z

@nicksav: Thanks very much for the log - that's interesting to see. Interestingly, my test failed, but in a different way: the process was aborted.

I realized that the Main method wasn't going to tell me the listener task status, either. I've changed it to:

static void Main(string[] args)
{
    try
    {
        Task updatingTask = UpdateAsync(args[0]);
        Task listeningTask = ListenAsync(args[0]);
        listeningTask.Wait();
        Console.WriteLine("Listening task finished");
        updatingTask.Wait();
    }
    catch (Exception e)
    {
        Console.WriteLine(e);
    }
}

I've restarted the test. I suspect eventually we'll need to capture gRPC detailed logs, but as lots of folks are now going on holiday, I suspect that'll be in 2019. Apologies that it's taking so long to work out what's going on here, but we'll get there eventually...

DenSmoke · 2018-12-14T15:16:12Z

This issue with listeners started to happen suspiciously often. I had to restart our server 3 times today. Is there some good workaround to detect listener disfunction? I can restart listeners after some interval but this will consume operations quota

jskeet · 2018-12-14T15:28:43Z

My test failed today, but did so in a way which could be detected: the task returned by FirestoreChangeListener.ListenerTask was faulted:

System.AggregateException: One or more errors occurred. (Status(StatusCode=Unavailable, Detail="Transport closed")) ---> Grpc.Core.RpcException: Status(StatusCode=Unavailable, Detail="Transport closed")
   at Google.Cloud.Firestore.WatchStream.<>c__DisplayClass16_0.<<StartAsync>g__CloseStreamAsync|1>d.MoveNext() in /home/jon/projects/google-cloud-dotnet/apis/Google.Cloud.Firestore/Google.Cloud.Firestore/WatchStream.cs:line 187
--- End of stack trace from previous location where exception was thrown ---
   at Google.Cloud.Firestore.WatchStream.<>c__DisplayClass16_0.<<StartAsync>g__GetNextResponse|0>d.MoveNext() in /home/jon/projects/google-cloud-dotnet/apis/Google.Cloud.Firestore/Google.Cloud.Firestore/WatchStream.cs:line 168
--- End of stack trace from previous location where exception was thrown ---
   at Google.Cloud.Firestore.WatchStream.StartAsync() in /home/jon/projects/google-cloud-dotnet/apis/Google.Cloud.Firestore/Google.Cloud.Firestore/WatchStream.cs:line 96
   at Google.Cloud.Firestore.WatchStream.StartAsync() in /home/jon/projects/google-cloud-dotnet/apis/Google.Cloud.Firestore/Google.Cloud.Firestore/WatchStream.cs:line 133
   at Issue2721.Program.ListenAsync(String projectId) in /home/jon/projects/google-cloud-dotnet/apis/Google.Cloud.Firestore/Issue2721/Program.cs:line 50
   --- End of inner exception stack trace ---
   at System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken)
   at System.Threading.Tasks.Task.Wait()
   at Issue2721.Program.Main(String[] args) in /home/jon/projects/google-cloud-dotnet/apis/Google.Cloud.Firestore/Issue2721/Program.cs:line 16

This is great news! It could be an explanation of what's going wrong:

The main loop of the listener is throwing an exception due to the stream dropping, which would normally be handled and restarted
While trying to close the stream, that throws an exception, which isn't caught

I can easily just swallow any exception that occurs while closing the stream, although I'd eventually want to have something a little more elegant than blindly swallowing it.

At this stage I'd rather not publish (on nuget.org) another full release before Christmas, but I'd be happy to create a "test package" for you to try, if you'd be willing to do so.

Even if you would rather not try that (and I'll test the fix either way myself) it suggests that if you just use the ListenerTask property, you should be able to detect the failure: if the task faults, the listener has died.

DenSmoke · 2018-12-14T15:43:46Z

Thanks for explanation. I never awaited ListenerTask result so didn't see that exception was thrown. I can easily observe it now and restart listener on failure

jskeet · 2018-12-14T15:50:02Z

@DenSmoke: Should I take it from that comment that you'd prefer not to test a speculative fix? (I completely understand if so - just making sure I've understood you correctly.)

I'm going to try that speculative fix now... (with a tiny bit of extra diagnostic code to observe when it's happened).

DenSmoke · 2018-12-14T16:01:50Z

@jskeet Yes, you understood correctly. I have not enough time to test the fix now

jskeet · 2018-12-16T13:25:41Z

@nicksav: Thanks - I had another error over the weekend, so need to work a bit more on the fix, but I'll hopefully have an interim build to test some time tomorrow.

jskeet · 2018-12-17T07:26:04Z

Update: I've had to change my test code to handle failures in the "update" side as well; I've also started logging the errors that cause the stream to be closed in the first place. It's possible that my fix did actually work, but that a lack of further updates was stopping it from showing anything more.
I've started it running again, and will see what that does. If it looks promising, I'll build a test package.

jskeet · 2018-12-18T11:55:39Z

I saw a failure yesterday and this morning which was unrelated (and which I've asked about internally). I could merge my expected fix in and do another release today; I don't believe it's harmful, and it would let others test over Christmas.

This contains a slightly-speculative fix for issue googleapis#2721, where listeners may fail after a while.

This contains a slightly-speculative fix for issue #2721, where listeners may fail after a while.

jskeet · 2018-12-18T14:56:27Z

I've now released 1.0.0-beta15. I won't close this issue just yet, but your feedback would be much appreciated. (The only change compared with 1.0.0-beta14 is swallowing exceptions when closing the listener stream.)

Note that if you get an exception about "Server removed watch target" that's a separate issue - I've seen it twice so far today, but never before.

jskeet · 2018-12-18T16:34:09Z

Note that until #2776 is fixed (and even after that, really), it's still worth watching the ListenerTask to detect failures.
I'll be looking at 2776 in the New Year.

nicksav · 2018-12-28T02:54:33Z

hi guys
Can anybody provide a good code example on how to correctly handle exceptions and restart listener?

DenSmoke · 2018-12-28T04:52:09Z

@nicksav the most simple way:

private async void ListenerRestarter(FirestoreChangeListener listener, Func<CancellationToken, Task> startListener, CancellationToken ct = default)
{
    try
    {
        await listener.ListenerTask.ConfigureAwait(false);
    }
    catch (Exception ex)
    {
        if (ex.GetType() != typeof(TaskCanceledException))
        {
            //handle exception
            await startListener(ct).ConfigureAwait(false);
        }
        else
            throw ex;
    }
}

Thaina · 2019-01-02T05:54:13Z

Just to make it easier

private async void ListenerRestarter(FirestoreChangeListener listener, Func<CancellationToken, Task> startListener, CancellationToken ct = default)
{
    try
    {
        await listener.ListenerTask.ConfigureAwait(false);
    }
    catch (Exception ex)
    {
        if(ex is TaskCanceledException)
            throw ex;

        //handle exception
        await startListener(ct).ConfigureAwait(false);
    }
}

jskeet · 2019-01-02T11:29:12Z

The listener ran without any exceptions for just over two weeks, so I suspect that we're fine apart from #2776.

DenSmoke · 2019-01-11T19:15:13Z

Something strange is going on today. My restarter catched these errors (time in UTC):

11.01.2019 8:53:46 FB reports listener error: RpcException Status(StatusCode=Aborted, Detail="The operation was aborted.")
11.01.2019 11:35:26 FB users listener error: RpcException Status(StatusCode=Aborted, Detail="The operation was aborted.")
11.01.2019 17:01:51 FB userLocations listener error: RpcException Status(StatusCode=Aborted, Detail="The operation was aborted.")
11.01.2019 18:33:09 FB orders listener error: RpcException Status(StatusCode=Aborted, Detail="The operation was aborted.")

Idk is it related to this issue or not

UPDATE:

13.01.2019 15:06:23 FB reports listener error: RpcException Status(StatusCode=Aborted, Detail="The operation was aborted.")
13.01.2019 15:06:23 FB Starting reports listener
13.01.2019 15:06:24 FB orders listener error: RpcException Status(StatusCode=Aborted, Detail="The operation was aborted.")
13.01.2019 15:06:24 FB Starting orders listener
13.01.2019 15:06:24 FB users listener error: RpcException Status(StatusCode=Aborted, Detail="The operation was aborted.")
13.01.2019 15:06:24 FB Starting users listener

jskeet · 2019-01-11T19:37:13Z

Odd indeed. I'm at a conference right now, but will look into it when I can.

jskeet · 2019-01-16T10:26:06Z

@DenSmoke: Do you have access to a stack trace at all? It would be interesting to know exactly where that's coming from. "Aborted" isn't one of the status codes we retry on - but I'll ask whether we should.

DenSmoke · 2019-01-16T10:54:50Z

@jskeet program is working on Compute Engine instance so I can't debug it. But I can modify exception catcher to log stack trace also

jskeet · 2019-01-16T10:56:15Z

@DenSmoke: That would be handy, thanks. I've asked internally about whether we should add Aborted to the list of status codes to retry.

jskeet · 2019-01-16T17:33:07Z

@DenSmoke: It looks like I should add Aborted to the list of server statuses to retry. I'll do that tomorrow.

DenSmoke · 2019-01-17T06:38:35Z

@jskeet New errors with stack traces. What's interesting - they were thrown right after program start. Until January 11, there were almost no such errors, but now I see about 3 errors a day.

17.01.2019 6:21:40 FB orders listener error: RpcException Status(StatusCode=Aborted, Detail="The operation was aborted.")
   at Grpc.Core.Internal.ClientResponseStream`2.MoveNext(CancellationToken token)
   at Google.Cloud.Firestore.WatchStream.<>c__DisplayClass16_0.<<StartAsync>g__GetNextResponse|0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at Google.Cloud.Firestore.WatchStream.StartAsync()
   at Google.Cloud.Firestore.WatchStream.StartAsync()
   at DpTools.Firebase.ListenerRestarter(String collectionName, FirestoreChangeListener listener, Func`2 startListener, CancellationToken ct) in D:\a\1\s\DpTools\Firebase.cs:line 316

17.01.2019 6:21:40 FB reports listener error: RpcException Status(StatusCode=Aborted, Detail="The operation was aborted.")
   at Grpc.Core.Internal.ClientResponseStream`2.MoveNext(CancellationToken token)
   at Google.Cloud.Firestore.WatchStream.<>c__DisplayClass16_0.<<StartAsync>g__GetNextResponse|0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at Google.Cloud.Firestore.WatchStream.StartAsync()
   at Google.Cloud.Firestore.WatchStream.StartAsync()
   at DpTools.Firebase.ListenerRestarter(String collectionName, FirestoreChangeListener listener, Func`2 startListener, CancellationToken ct) in D:\a\1\s\DpTools\Firebase.cs:line 316

jskeet · 2019-01-17T08:26:30Z

That's interesting about them being on start-up. It's good to know they're in MoveNext. When I've added retry, I'd expect these to go away. I'm hoping to make the changes and create another release today.

- Retry on Aborted - Use RpcException when the server removes a target (we don't want to retry) Fixes googleapis#2776 Fixes googleapis#2721

- Retry on Aborted - Use RpcException when the server removes a target (we don't want to retry) Fixes #2776 Fixes #2721

jskeet self-assigned this Nov 23, 2018

jskeet added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. api: firestore Issues related to the Firestore API. labels Nov 23, 2018

JustinBeckwith added the triage me I really want to be triaged. label Nov 23, 2018

jskeet added priority: p2 Moderately-important priority. Fix may not be included in next release. and removed triage me I really want to be triaged. labels Nov 27, 2018

jskeet mentioned this issue Dec 10, 2018

Retry stream operations for Firestore #2760

Closed

jskeet added a commit to jskeet/google-cloud-dotnet that referenced this issue Dec 18, 2018

Release Google.Cloud.Firestore version 1.0.0-beta15

e413b96

This contains a slightly-speculative fix for issue googleapis#2721, where listeners may fail after a while.

jskeet added a commit that referenced this issue Dec 18, 2018

Release Google.Cloud.Firestore version 1.0.0-beta15

e8f4396

This contains a slightly-speculative fix for issue #2721, where listeners may fail after a while.

jskeet mentioned this issue Dec 18, 2018

Firestore watch should handle more server errors #2776

Closed

jskeet added a commit to jskeet/google-cloud-dotnet that referenced this issue Jan 17, 2019

Update error handling in Firestore Watch

7f72314

- Retry on Aborted - Use RpcException when the server removes a target (we don't want to retry) Fixes googleapis#2776 Fixes googleapis#2721

jskeet added a commit to jskeet/google-cloud-dotnet that referenced this issue Jan 17, 2019

Update error handling in Firestore Watch

ffab0d3

- Retry on Aborted - Use RpcException when the server removes a target (we don't want to retry) Fixes googleapis#2776 Fixes googleapis#2721

jskeet mentioned this issue Jan 17, 2019

Update error handling in Firestore Watch #2816

Merged

jskeet closed this as completed in #2816 Jan 17, 2019

jskeet added a commit that referenced this issue Jan 17, 2019

Update error handling in Firestore Watch

ea3756d

- Retry on Aborted - Use RpcException when the server removes a target (we don't want to retry) Fixes #2776 Fixes #2721

This was referenced Mar 15, 2019

onSnapshot listener dies after some time with "Error 10: ABORTED" firebase/firebase-admin-node#478

Closed

Adding retry_on_conflict parameter? acupofjose/elasticstore#2

Open

jskeet added api: firestore Issues related to the Firestore API. and removed api: firestore Issues related to the Firestore API. labels Apr 17, 2019

FireStore listener as a service stop working after a day #2721

FireStore listener as a service stop working after a day #2721

Comments

nicksav commented Nov 23, 2018

jskeet commented Nov 23, 2018

jskeet commented Nov 27, 2018

nicksav commented Dec 4, 2018 • edited Loading

jskeet commented Dec 4, 2018

jskeet commented Dec 4, 2018

jskeet commented Dec 5, 2018

nicksav commented Dec 5, 2018

DenSmoke commented Dec 5, 2018

jskeet commented Dec 5, 2018

DenSmoke commented Dec 5, 2018

jskeet commented Dec 6, 2018

DenSmoke commented Dec 7, 2018

pauloevpr commented Dec 7, 2018 • edited Loading

jskeet commented Dec 7, 2018

jskeet commented Dec 10, 2018

nicksav commented Dec 14, 2018

jskeet commented Dec 14, 2018

jskeet commented Dec 14, 2018

DenSmoke commented Dec 14, 2018

jskeet commented Dec 14, 2018

DenSmoke commented Dec 14, 2018

jskeet commented Dec 14, 2018

DenSmoke commented Dec 14, 2018

jskeet commented Dec 16, 2018

jskeet commented Dec 17, 2018

jskeet commented Dec 18, 2018

jskeet commented Dec 18, 2018

jskeet commented Dec 18, 2018

nicksav commented Dec 28, 2018

DenSmoke commented Dec 28, 2018

Thaina commented Jan 2, 2019

jskeet commented Jan 2, 2019

DenSmoke commented Jan 11, 2019 • edited Loading

jskeet commented Jan 11, 2019

jskeet commented Jan 16, 2019

DenSmoke commented Jan 16, 2019

jskeet commented Jan 16, 2019

jskeet commented Jan 16, 2019

DenSmoke commented Jan 17, 2019

jskeet commented Jan 17, 2019

nicksav commented Dec 4, 2018 •

edited

Loading

pauloevpr commented Dec 7, 2018 •

edited

Loading

DenSmoke commented Jan 11, 2019 •

edited

Loading