Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shield Reolink webhook callback from cancelation #89798

Merged
merged 5 commits into from
Mar 20, 2023

Conversation

starkillerOG
Copy link
Contributor

Breaking change

Proposed change

Some camera's like the RLC-520A do not (yet) support rich ONVIF notifications that include both motion and AI detection states.
These camera's only send motion pushes.
Therefore when a motion push is received, the AI detection state will be polled inside the async webhook callback which takes between 0.5 and 1.5 seconds on average. This works since a vehicle/pet/person most of the time will also cause a motion event besides the AI detection event.
This is not ideal, but the best we can do on our site (Reolink schould update the firmware to support rich ONVIF notifications).

However since the poll takes 0.5 to 1.5 seconds, the asyncio webhook callback task would get cancelled while waiting on the response, in the upstream library on line:
https://github.com/starkillerOG/reolink_aio/blob/2362257923badb708543643a2f1e3c7bcb07cdb5/reolink_aio/api.py#L3149
response = await self._aiohttp_session.post(url=self._url, json=body, params=param, allow_redirects=False)

This PR shields the callback task, such that it does not get cancelled and the response and further processing of the callback can finish.

Note: it is automatically detected if a camera supports rich ONVIF notifications, if it does, no polling is used inside the webhook callback.

(It took me forever to figure out what was going on, this was a hard bug to solve).

Could this be added to the next milestone?

Type of change

  • Dependency upgrade
  • Bugfix (non-breaking change which fixes an issue)
  • New integration (thank you!)
  • New feature (which adds functionality to an existing integration)
  • Deprecation (breaking change to happen in the future)
  • Breaking change (fix/feature causing existing functionality to break)
  • Code quality improvements to existing code or addition of tests

Additional information

Checklist

  • The code change is tested and works locally.
  • Local tests pass. Your PR cannot be merged unless tests pass
  • There is no commented out code in this PR.
  • I have followed the development checklist
  • I have followed the perfect PR recommendations
  • The code has been formatted using Black (black --fast homeassistant tests)
  • Tests have been added to verify that the new code works.

If user exposed functionality or configuration variables are added/changed:

If the code communicates with devices, web services, or third-party tools:

  • The manifest file has all fields filled out correctly.
    Updated and included derived files by running: python3 -m script.hassfest.
  • New or updated dependencies have been added to requirements_all.txt.
    Updated by running python3 -m script.gen_requirements_all.
  • For the updated dependencies - a link to the changelog, or at minimum a diff between library versions is added to the PR description.
  • Untested files have been added to .coveragerc.

To help with the load of incoming pull requests:

@starkillerOG starkillerOG marked this pull request as draft March 17, 2023 11:58
@starkillerOG
Copy link
Contributor Author

starkillerOG commented Mar 17, 2023

Seeing something weird in testing (see #89798 (comment))
Reverted back to using asyncio.shield, now it is working fine again

@starkillerOG starkillerOG marked this pull request as ready for review March 17, 2023 12:07
@MartinHjelmare
Copy link
Member

Why is the webhook callback cancelled in the first place?

However since the poll takes 0.5 to 1.5 seconds, the asyncio webhook callback task would get cancelled while waiting on the response, in the upstream library on line:

@starkillerOG
Copy link
Contributor Author

starkillerOG commented Mar 17, 2023

@MartinHjelmare The cancelation originates from somewhere upstream to handle_webhook, so it is in the base homeassistant code or in the webhook library used by homeassistant.

I have tried finding it by using:

        try:
            await asyncio.shield(
                self.handle_webhook_shielded(hass, webhook_id, request)
            )
        except asyncio.CancelledError as err:
            _LOGGER.exception(err)
            raise

But that just gives me:

2023-03-17 14:13:26.082 ERROR (MainThread) [homeassistant.components.reolink.host]
Traceback (most recent call last):
  File "/home/hass/home-assistant/homeassistant/components/reolink/host.py", line 333, in handle_webhook
    await asyncio.shield(
asyncio.exceptions.CancelledError

Which confirms it is cancelled by something above handle_webhook, but is not very helpfull in figuring out where.

So bottom line: I have no qlue.

My best bet at this moment is that it is simular to what is discribed here:
aio-libs/aiohttp#2492
aiohttp web server cancells the task when the client disconnects, so I am guessing what happens:

  1. The reolink camera connects to homeassistant webhook and sends its ONVIF message.
  2. HomeAssistant starts and awaits the callback here:
    response = await webhook["handler"](hass, webhook_id, request)
  3. The Reolink camera does not wait on the HTTP OK, but just disconnects from HomeAssistant
  4. aiohttp web server cancells the still running callback.

Not that this is basically speculation, because I have very little knowladge of how homeassistant and aiohttp web server are actually handeling webhooks.

Further more recent explanation seems to be here: aio-libs/aiohttp#6719

@starkillerOG
Copy link
Contributor Author

@MartinHjelmare I found the answer in the documentation of aiohttp:
https://github.com/aio-libs/aiohttp/blob/master/docs/web_advanced.rst#web-handler-cancellation
Indeed aiohttp web server cancels the task when the client (reolink camera) disconnects from the webhook.
The aiohttp documentation suggests two possible ways of dealing with this:

Applying :func:asyncio.shield to a coroutine that saves data.
Using aiojobs or another third party library.

So I did implement the correct solution.

@starkillerOG
Copy link
Contributor Author

starkillerOG commented Mar 17, 2023

This HomeAssistant PR #88046 implemented the cancallation pollicy in a wrapper untill aiohttp 3.9 is released.

@MartinHjelmare
So these lines inside HomeAssistant are causing this issue:

def connection_lost(self, exc: BaseException | None) -> None:
"""Handle connection lost."""
task_handler = self._task_handler
super().connection_lost(exc)
if task_handler is not None:
task_handler.cancel()

@balloob
Copy link
Member

balloob commented Mar 17, 2023

If a camera drops a connection, the handler is cancelled. That has been the behavior of aiohttp until 3.8.3 changed it, and we put it back until we can make a decision how we want to move forward with this.

In this case, the task is lingering because you start it but don't await it – but that's the point 😅 You should run await hass.async_block_till_done() to await the task created using hass.async_create_task(...)

@starkillerOG
Copy link
Contributor Author

@balloob I get it, thanks for the info.
But what is wrong with asyncio.shield?
That is also proposed by the aiohttp docs.

@balloob
Copy link
Member

balloob commented Mar 20, 2023

yeah I guess it's fine.

@balloob balloob merged commit 939fce4 into home-assistant:dev Mar 20, 2023
@github-actions github-actions bot locked and limited conversation to collaborators Mar 21, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants