Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FireStore listener as a service stop working after a day #2721

Closed
nicksav opened this issue Nov 23, 2018 · 44 comments
Closed

FireStore listener as a service stop working after a day #2721

nicksav opened this issue Nov 23, 2018 · 44 comments
Assignees
Labels
api: firestore Issues related to the Firestore API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@nicksav
Copy link

nicksav commented Nov 23, 2018

Version: Google.Cloud.Firestore-1.0.0-beta13

Guys, I am having some issues with FireStore real time updates. I have deployed a web app to Azure Web Site. On a start up, this application subscribes to the updates in the FireStore users table.
`

        GoogleCredential googleCredential = googleCredential = GoogleCredential.FromJson(JsonConvert.SerializeObject(secret));
        var channel = new Grpc.Core.Channel(FirestoreClient.DefaultEndpoint.Host,
            googleCredential.ToChannelCredentials());

        FirestoreClient client = FirestoreClient.Create(channel);
        FirestoreDb db = FirestoreDb.Create(projectId: "myproject-2018", client: client);

        CollectionReference collection = db.Collection("users");


        FirestoreChangeListener listener = collection.Listen(async querySnapshot =>
        {
        })

`

All works perfect, but after a couple hours, or sometimes days, it just stop receiving updates without any errors. Is there any way for me to debug it to understand why?

@jskeet
Copy link
Collaborator

jskeet commented Nov 23, 2018

When you say "without any errors" - what happens to the task exposed as listener.ListenerTask? Does that:

  • Complete
  • Stay uncompleted
  • Become faulted

? That would be the first piece of interesting information.

In terms of further diagnostics, there's nothing particularly built into the library to help here, unfortunately. You could turn on gRPC tracing, but that would generate a huge amount of logging.

Does everything else in the application continue to work? In particular, are you doing other Firestore operations on the same Channel (or even with the same FirestoreClient)?

If we could work out a way of reproducing this in an isolated way, that would be ideal. Do you think having one console application which had basically the code you've already shown, and another that just updates a single user entry once every (say) 15 seconds, would be a good start? I'd be happy to create that and just leave it running.

@jskeet jskeet self-assigned this Nov 23, 2018
@jskeet jskeet added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. api: firestore Issues related to the Firestore API. labels Nov 23, 2018
@JustinBeckwith JustinBeckwith added the triage me I really want to be triaged. label Nov 23, 2018
@jskeet jskeet added priority: p2 Moderately-important priority. Fix may not be included in next release. and removed triage me I really want to be triaged. labels Nov 27, 2018
@jskeet
Copy link
Collaborator

jskeet commented Nov 27, 2018

I've now written some test code. Will leave it running and see what happens:

using Google.Cloud.Firestore;
using System;
using System.Linq;
using System.Threading.Tasks;

namespace Issue2721
{
    class Program
    {
        static void Main(string[] args)
        {
            Task updatingTask = UpdateAsync(args[0]);
            Task listeningTask = ListenAsync(args[0]);
            updatingTask.Wait();
            listeningTask.Wait();
        }

        static async Task UpdateAsync(string projectId)
        {
            var db = FirestoreDb.Create(projectId);
            var document = db.Collection("Issue2721").Document("testdoc");
            while (true)
            {
                var now = DateTime.UtcNow;
                await document.SetAsync(new { LastUpdated = now });
                Log($"Updated to {now:HH:mm:ss.ffffff}");
                await Task.Delay(TimeSpan.FromSeconds(15));
            }
        }

        static async Task ListenAsync(string projectId)
        {
            var db = FirestoreDb.Create(projectId);
            CollectionReference collection = db.Collection("Issue2721");

            FirestoreChangeListener listener = collection.Listen(snapshot =>
            {
                var doc = snapshot.Documents.First();
                var value = doc.GetValue<DateTime>("LastUpdated");
                Log($"Observed value {value:HH:mm:ss.ffffff}");
            });
            await listener.ListenerTask;
        }

        static void Log(string message) =>
            Console.WriteLine($"{DateTime.UtcNow:HH:mm:ss.ffffff}: {message}");
    }
}

@nicksav
Copy link
Author

nicksav commented Dec 4, 2018

Hey jskeet. Was working ok after some time? I will deploy your code to Azure Web Site and also will leave for couple days. Will update you on monday with results

@jskeet
Copy link
Collaborator

jskeet commented Dec 4, 2018

Eek! I sincerely apologise - I dropped the ball on this one. I started it running on my laptop first, and of course that dropped the connection when the laptop went to sleep.

Will get it running on a desktop, writing to a file so I can see what's going on...

@jskeet
Copy link
Collaborator

jskeet commented Dec 4, 2018

(It's been running for 4 hours now - no problems so far, but we'll see what happens over the next 24-48 hours...)

@jskeet
Copy link
Collaborator

jskeet commented Dec 5, 2018

First test died after 18 hours due to the "updating" code failing with a ServiceUnavailable error. Rerunning - I'd really like to see the exact behavior you described.

@nicksav
Copy link
Author

nicksav commented Dec 5, 2018

Yep, also deployed to Azure. Will provide link tomorrow. So if its running on Azure how to prevent from getting ServiceUnavailable error? Just to re-attach the listener to continue listening for changes?

@DenSmoke
Copy link

DenSmoke commented Dec 5, 2018

I had the same issue but it disappeared by itself somehow. I mean listeners stopped to work without exception after about one day

@jskeet
Copy link
Collaborator

jskeet commented Dec 5, 2018

@DenSmoke: I'm not quite sure I understand what you mean by "it disappeared by itself" - the issue disappeared, or the listener disappeared? If you still have listeners that are stopping reporting changes (but without the task failing), that's what we're trying to reproduce.

@nicksav: The listener should reattach itself on ServiceUnavailable errors (I think - it certainly does for some errors; I'd need to check exactly which). In my case it was the code making the change that failed with that error.

@DenSmoke
Copy link

DenSmoke commented Dec 5, 2018

@jskeet I couldn't reproduce the issue again so "issue disappeared". It was on beta11

@jskeet
Copy link
Collaborator

jskeet commented Dec 6, 2018

The new test has been running for 25 hours with no problems so far. That doesn't mean it won't happen eventually though...

@DenSmoke
Copy link

DenSmoke commented Dec 7, 2018

Problem happen again in .NET Core 2.2 console app with Google.Cloud.Firestore beta14 on Ubuntu 18.04 running in Compute Engine f1-micro instance.

@pauloevpr
Copy link

pauloevpr commented Dec 7, 2018

I have a similar issue. I am running an Asp.Net Core 2.1 Web API on Google Cloud App Engine. The FirestoreListener stops working eventually. I can't say it is precisely after 24 hours. It seems just random.

From my analysis so far, it seems to be an issue with Firestore backend rather than the Firestore .NET SDK. Firestore randomly fails when reading/writing documents. It is not just about the listeners.

For example. Yesterday my Asp.Net Core Web API had the following exception when getting a document snapshot:

Grpc.Core.RpcException: Status(StatusCode=PermissionDenied, Detail="Missing or insufficient permissions.")

I have seen this at least once every month.

That is not the only one. I also have this one randomly once in a while when reading documents:

Grpc.Core.RpcException: Status(StatusCode=Unavailable, Detail="The datastore operation timed out, or the data was temporarily unavailable.")

My conclusion is that, as a beta product, it is possible that Firestore backend randomly fails to serve requests for whatever reason.

For the specific listener issue, my workaround is to have a function triggered by a timer which will restore the listener as required.

@jskeet
Copy link
Collaborator

jskeet commented Dec 7, 2018

Thanks for the updates everyone - and particularly for that extra detail, @pauloevpr.
My own test is still going strong after about 52 hours, so isn't giving any more information :(

I'll contact the Firestore internal team and point them at this issue. They may have some ideas about what's going on.

@jskeet
Copy link
Collaborator

jskeet commented Dec 10, 2018

Internal feedback suggests this may be a gRPC issue rather than a server side issue. Obviously it being hard to reproduce doesn't help us :( (The test I started 5 days ago is still going strong...)

@pauloevpr: I've filed #2760 for the Unavailable issue. The PermissionDenied issue is odder. Have you seen it more than just the once? If you see it again, could you send me an email ([email protected]) with your project ID and a timestamp, and we'll look in the logs?

@nicksav
Copy link
Author

nicksav commented Dec 14, 2018

Ok left to run on Azure and again, stooped working after time.
pic

@jskeet
Copy link
Collaborator

jskeet commented Dec 14, 2018

Thanks - will check mine!

@jskeet
Copy link
Collaborator

jskeet commented Dec 14, 2018

@nicksav: Thanks very much for the log - that's interesting to see. Interestingly, my test failed, but in a different way: the process was aborted.

I realized that the Main method wasn't going to tell me the listener task status, either. I've changed it to:

static void Main(string[] args)
{
    try
    {
        Task updatingTask = UpdateAsync(args[0]);
        Task listeningTask = ListenAsync(args[0]);
        listeningTask.Wait();
        Console.WriteLine("Listening task finished");
        updatingTask.Wait();
    }
    catch (Exception e)
    {
        Console.WriteLine(e);
    }
}

I've restarted the test. I suspect eventually we'll need to capture gRPC detailed logs, but as lots of folks are now going on holiday, I suspect that'll be in 2019. Apologies that it's taking so long to work out what's going on here, but we'll get there eventually...

@DenSmoke
Copy link

This issue with listeners started to happen suspiciously often. I had to restart our server 3 times today. Is there some good workaround to detect listener disfunction? I can restart listeners after some interval but this will consume operations quota

@jskeet
Copy link
Collaborator

jskeet commented Dec 14, 2018

My test failed today, but did so in a way which could be detected: the task returned by FirestoreChangeListener.ListenerTask was faulted:

System.AggregateException: One or more errors occurred. (Status(StatusCode=Unavailable, Detail="Transport closed")) ---> Grpc.Core.RpcException: Status(StatusCode=Unavailable, Detail="Transport closed")
   at Google.Cloud.Firestore.WatchStream.<>c__DisplayClass16_0.<<StartAsync>g__CloseStreamAsync|1>d.MoveNext() in /home/jon/projects/google-cloud-dotnet/apis/Google.Cloud.Firestore/Google.Cloud.Firestore/WatchStream.cs:line 187
--- End of stack trace from previous location where exception was thrown ---
   at Google.Cloud.Firestore.WatchStream.<>c__DisplayClass16_0.<<StartAsync>g__GetNextResponse|0>d.MoveNext() in /home/jon/projects/google-cloud-dotnet/apis/Google.Cloud.Firestore/Google.Cloud.Firestore/WatchStream.cs:line 168
--- End of stack trace from previous location where exception was thrown ---
   at Google.Cloud.Firestore.WatchStream.StartAsync() in /home/jon/projects/google-cloud-dotnet/apis/Google.Cloud.Firestore/Google.Cloud.Firestore/WatchStream.cs:line 96
   at Google.Cloud.Firestore.WatchStream.StartAsync() in /home/jon/projects/google-cloud-dotnet/apis/Google.Cloud.Firestore/Google.Cloud.Firestore/WatchStream.cs:line 133
   at Issue2721.Program.ListenAsync(String projectId) in /home/jon/projects/google-cloud-dotnet/apis/Google.Cloud.Firestore/Issue2721/Program.cs:line 50
   --- End of inner exception stack trace ---
   at System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken)
   at System.Threading.Tasks.Task.Wait()
   at Issue2721.Program.Main(String[] args) in /home/jon/projects/google-cloud-dotnet/apis/Google.Cloud.Firestore/Issue2721/Program.cs:line 16

This is great news! It could be an explanation of what's going wrong:

  • The main loop of the listener is throwing an exception due to the stream dropping, which would normally be handled and restarted
  • While trying to close the stream, that throws an exception, which isn't caught

I can easily just swallow any exception that occurs while closing the stream, although I'd eventually want to have something a little more elegant than blindly swallowing it.

At this stage I'd rather not publish (on nuget.org) another full release before Christmas, but I'd be happy to create a "test package" for you to try, if you'd be willing to do so.

Even if you would rather not try that (and I'll test the fix either way myself) it suggests that if you just use the ListenerTask property, you should be able to detect the failure: if the task faults, the listener has died.

@DenSmoke
Copy link

Thanks for explanation. I never awaited ListenerTask result so didn't see that exception was thrown. I can easily observe it now and restart listener on failure

@jskeet
Copy link
Collaborator

jskeet commented Dec 14, 2018

@DenSmoke: Should I take it from that comment that you'd prefer not to test a speculative fix? (I completely understand if so - just making sure I've understood you correctly.)

I'm going to try that speculative fix now... (with a tiny bit of extra diagnostic code to observe when it's happened).

@DenSmoke
Copy link

@jskeet Yes, you understood correctly. I have not enough time to test the fix now

@jskeet
Copy link
Collaborator

jskeet commented Dec 16, 2018

@nicksav: Thanks - I had another error over the weekend, so need to work a bit more on the fix, but I'll hopefully have an interim build to test some time tomorrow.

@jskeet
Copy link
Collaborator

jskeet commented Dec 17, 2018

Update: I've had to change my test code to handle failures in the "update" side as well; I've also started logging the errors that cause the stream to be closed in the first place. It's possible that my fix did actually work, but that a lack of further updates was stopping it from showing anything more.
I've started it running again, and will see what that does. If it looks promising, I'll build a test package.

@jskeet
Copy link
Collaborator

jskeet commented Dec 18, 2018

I saw a failure yesterday and this morning which was unrelated (and which I've asked about internally). I could merge my expected fix in and do another release today; I don't believe it's harmful, and it would let others test over Christmas.

jskeet added a commit to jskeet/google-cloud-dotnet that referenced this issue Dec 18, 2018
This contains a slightly-speculative fix for issue googleapis#2721, where listeners may fail after a while.
jskeet added a commit that referenced this issue Dec 18, 2018
This contains a slightly-speculative fix for issue #2721, where listeners may fail after a while.
@jskeet
Copy link
Collaborator

jskeet commented Dec 18, 2018

I've now released 1.0.0-beta15. I won't close this issue just yet, but your feedback would be much appreciated. (The only change compared with 1.0.0-beta14 is swallowing exceptions when closing the listener stream.)

Note that if you get an exception about "Server removed watch target" that's a separate issue - I've seen it twice so far today, but never before.

@jskeet
Copy link
Collaborator

jskeet commented Dec 18, 2018

Note that until #2776 is fixed (and even after that, really), it's still worth watching the ListenerTask to detect failures.
I'll be looking at 2776 in the New Year.

@nicksav
Copy link
Author

nicksav commented Dec 28, 2018

hi guys
Can anybody provide a good code example on how to correctly handle exceptions and restart listener?

@DenSmoke
Copy link

@nicksav the most simple way:

private async void ListenerRestarter(FirestoreChangeListener listener, Func<CancellationToken, Task> startListener, CancellationToken ct = default)
{
    try
    {
        await listener.ListenerTask.ConfigureAwait(false);
    }
    catch (Exception ex)
    {
        if (ex.GetType() != typeof(TaskCanceledException))
        {
            //handle exception
            await startListener(ct).ConfigureAwait(false);
        }
        else
            throw ex;
    }
}

@Thaina
Copy link

Thaina commented Jan 2, 2019

Just to make it easier

private async void ListenerRestarter(FirestoreChangeListener listener, Func<CancellationToken, Task> startListener, CancellationToken ct = default)
{
    try
    {
        await listener.ListenerTask.ConfigureAwait(false);
    }
    catch (Exception ex)
    {
        if(ex is TaskCanceledException)
            throw ex;

        //handle exception
        await startListener(ct).ConfigureAwait(false);
    }
}

@jskeet
Copy link
Collaborator

jskeet commented Jan 2, 2019

The listener ran without any exceptions for just over two weeks, so I suspect that we're fine apart from #2776.

@DenSmoke
Copy link

DenSmoke commented Jan 11, 2019

Something strange is going on today. My restarter catched these errors (time in UTC):

11.01.2019 8:53:46 FB reports listener error: RpcException Status(StatusCode=Aborted, Detail="The operation was aborted.")
11.01.2019 11:35:26 FB users listener error: RpcException Status(StatusCode=Aborted, Detail="The operation was aborted.")
11.01.2019 17:01:51 FB userLocations listener error: RpcException Status(StatusCode=Aborted, Detail="The operation was aborted.")
11.01.2019 18:33:09 FB orders listener error: RpcException Status(StatusCode=Aborted, Detail="The operation was aborted.")

Idk is it related to this issue or not

UPDATE:

13.01.2019 15:06:23 FB reports listener error: RpcException Status(StatusCode=Aborted, Detail="The operation was aborted.")
13.01.2019 15:06:23 FB Starting reports listener
13.01.2019 15:06:24 FB orders listener error: RpcException Status(StatusCode=Aborted, Detail="The operation was aborted.")
13.01.2019 15:06:24 FB Starting orders listener
13.01.2019 15:06:24 FB users listener error: RpcException Status(StatusCode=Aborted, Detail="The operation was aborted.")
13.01.2019 15:06:24 FB Starting users listener

@jskeet
Copy link
Collaborator

jskeet commented Jan 11, 2019

Odd indeed. I'm at a conference right now, but will look into it when I can.

@jskeet
Copy link
Collaborator

jskeet commented Jan 16, 2019

@DenSmoke: Do you have access to a stack trace at all? It would be interesting to know exactly where that's coming from. "Aborted" isn't one of the status codes we retry on - but I'll ask whether we should.

@DenSmoke
Copy link

@jskeet program is working on Compute Engine instance so I can't debug it. But I can modify exception catcher to log stack trace also

@jskeet
Copy link
Collaborator

jskeet commented Jan 16, 2019

@DenSmoke: That would be handy, thanks. I've asked internally about whether we should add Aborted to the list of status codes to retry.

@jskeet
Copy link
Collaborator

jskeet commented Jan 16, 2019

@DenSmoke: It looks like I should add Aborted to the list of server statuses to retry. I'll do that tomorrow.

@DenSmoke
Copy link

@jskeet New errors with stack traces. What's interesting - they were thrown right after program start. Until January 11, there were almost no such errors, but now I see about 3 errors a day.

17.01.2019 6:21:40 FB orders listener error: RpcException Status(StatusCode=Aborted, Detail="The operation was aborted.")
   at Grpc.Core.Internal.ClientResponseStream`2.MoveNext(CancellationToken token)
   at Google.Cloud.Firestore.WatchStream.<>c__DisplayClass16_0.<<StartAsync>g__GetNextResponse|0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at Google.Cloud.Firestore.WatchStream.StartAsync()
   at Google.Cloud.Firestore.WatchStream.StartAsync()
   at DpTools.Firebase.ListenerRestarter(String collectionName, FirestoreChangeListener listener, Func`2 startListener, CancellationToken ct) in D:\a\1\s\DpTools\Firebase.cs:line 316

17.01.2019 6:21:40 FB reports listener error: RpcException Status(StatusCode=Aborted, Detail="The operation was aborted.")
   at Grpc.Core.Internal.ClientResponseStream`2.MoveNext(CancellationToken token)
   at Google.Cloud.Firestore.WatchStream.<>c__DisplayClass16_0.<<StartAsync>g__GetNextResponse|0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at Google.Cloud.Firestore.WatchStream.StartAsync()
   at Google.Cloud.Firestore.WatchStream.StartAsync()
   at DpTools.Firebase.ListenerRestarter(String collectionName, FirestoreChangeListener listener, Func`2 startListener, CancellationToken ct) in D:\a\1\s\DpTools\Firebase.cs:line 316

@jskeet
Copy link
Collaborator

jskeet commented Jan 17, 2019

That's interesting about them being on start-up. It's good to know they're in MoveNext. When I've added retry, I'd expect these to go away. I'm hoping to make the changes and create another release today.

jskeet added a commit to jskeet/google-cloud-dotnet that referenced this issue Jan 17, 2019
- Retry on Aborted
- Use RpcException when the server removes a target (we don't want to retry)

Fixes googleapis#2776
Fixes googleapis#2721
jskeet added a commit to jskeet/google-cloud-dotnet that referenced this issue Jan 17, 2019
- Retry on Aborted
- Use RpcException when the server removes a target (we don't want to retry)

Fixes googleapis#2776
Fixes googleapis#2721
jskeet added a commit that referenced this issue Jan 17, 2019
- Retry on Aborted
- Use RpcException when the server removes a target (we don't want to retry)

Fixes #2776
Fixes #2721
@jskeet jskeet added api: firestore Issues related to the Firestore API. and removed api: firestore Issues related to the Firestore API. labels Apr 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: firestore Issues related to the Firestore API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

6 participants