Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Purge causes recorder to stop writing to the DB until HA is restarted (Auto purge happens at 4:12am) #117263

Closed
HertogArjan opened this issue May 11, 2024 · 153 comments · Fixed by #120779

Comments

@HertogArjan
Copy link

HertogArjan commented May 11, 2024

This problem is solved in 2024.7.2. If the system ran out of disk space, and the table rebuild failed, it will try again in 2024.8.1+ see issue #123348 and solution #123388

Workaround: Disabling nightly auto purge will prevent the issue from occurring (this is not a long term solution)

# Example configuration.yaml entry
recorder:
  auto_purge: false

Be sure to re-enable auto-purge after installing 2024.7.2 or your database will grow without bounds, and your system will eventually run out of disk space or become sluggish.

Cause: #117263 (comment)
Solution: #120779

The problem

Every night at around ~4:10 the histories for all entities stop. This has been happening since at least April 9th. I updated Home Assistant to 2024.4.1 on April 5th, but I can't say for sure if this issue started directly afterwards. A restart of Home Assistant allows recording again but does not restore the history missed since ~4:10. I suspect it has something to do with the Recorder auto purge at 4:12 because the same symptoms happen when the purge is run manually.

I don't think the manual or automatic purge is currently able to finish because the (SQLite) database seems way too large (>6GB) for my configured purge_keep_days of 7.

If I run recorder.purge from the web UI the same symptoms happen like during the night. By looking at the mtime it is clear home-assistant_v2.db does not get written to anymore. htop shows HA using 100% of one CPU core continously and iotop show HA reading from disk at ~400MB/s continously. This went on for at least 25 minutes before I stopped the process.

The logs show nothing unusual happening around 4:12. When I run recorder.purge from the web UI with verbose logging enabled the logs just show:

2024-05-11 15:16:28.560 INFO (MainThread) [homeassistant.helpers.script.websocket_api_script] websocket_api script: Running websocket_api script
2024-05-11 15:16:28.560 INFO (MainThread) [homeassistant.helpers.script.websocket_api_script] websocket_api script: Executing step call service

When HA is stopped using SIGTERM the shutdown takes a long time and it is clear from the logs it is waiting for a Recorder task:

2024-05-11 15:20:00.573 WARNING (MainThread) [homeassistant.core] Shutdown stage 'final write': still running: <Task pending name='Task-2684' coro=<Recorder._async_shutdown() running at /srv/homeassistant/lib/python3.12/site-packages/homeassistant/components/recorder/core.py:475> wait_for=<Future pending cb=[_chain_future.<locals>._call_check_cancel() at /usr/lib/python3.12/asyncio/futures.py:387, <1 more>, Task.task_wakeup()] created at /usr/lib/python3.12/asyncio/base_events.py:449> cb=[set.remove()] created at /srv/homeassistant/lib/python3.12/site-packages/homeassistant/util/async_.py:40>

See the rest of the relevant messages during shutdown below.

What version of Home Assistant Core has the issue?

core-2024.5.2

What was the last working version of Home Assistant Core?

No response

What type of installation are you running?

Home Assistant Core

Integration causing the issue

Recorder

Link to integration documentation on our website

https://www.home-assistant.io/integrations/recorder/#service-purge

Diagnostics information

No response

Example YAML snippet

recorder:
  # keep 30 days of history for all states by default
  purge_keep_days: 7
  exclude:
    domains:
      - weather
    entities:
      - sun.sun
    entity_globs:
      - 'automation.abrp_live_data_*'
      - 'timer.abrp_live_data_*'
      - 'automation.pvoutput_*'
      - 'timer.pvoutput_*'
      - 'sensor.sampled_stroomsterkte_fase_l?'
      - 'sensor.stroomsterkte_fase_l?_*_sec_gem'

Anything in the logs that might be useful for us?

2024-05-11 15:20:00.573 WARNING (MainThread) [homeassistant.core] Timed out waiting for final writes to complete, the shutdown will continue
2024-05-11 15:20:00.573 WARNING (MainThread) [homeassistant.core] Shutdown stage 'final write': still running: <Task pending name='Task-2684' coro=<Recorder._async_shutdown() running at /srv/homeassistant/lib/python3.12/site-packages/homeassistant/components/recorder/core.py:475> wait_for=<Future pending cb=[_chain_future.<locals>._call_check_cancel() at /usr/lib/python3.12/asyncio/futures.py:387, <1 more>, Task.task_wakeup()] created at /usr/lib/python3.12/asyncio/base_events.py:449> cb=[set.remove()] created at /srv/homeassistant/lib/python3.12/site-packages/homeassistant/util/async_.py:40>
2024-05-11 15:20:30.580 WARNING (MainThread) [homeassistant.core] Timed out waiting for close event to be processed, the shutdown will continue
2024-05-11 15:20:30.580 WARNING (MainThread) [homeassistant.core] Shutdown stage 'close': still running: <Task pending name='Task-2684' coro=<Recorder._async_shutdown() running at /srv/homeassistant/lib/python3.12/site-packages/homeassistant/components/recorder/core.py:475> wait_for=<Future pending cb=[_chain_future.<locals>._call_check_cancel() at /usr/lib/python3.12/asyncio/futures.py:387, <1 more>, Task.task_wakeup()] created at /usr/lib/python3.12/asyncio/base_events.py:449> cb=[set.remove()] created at /srv/homeassistant/lib/python3.12/site-packages/homeassistant/util/async_.py:40>
2024-05-11 15:20:30.580 WARNING (MainThread) [homeassistant.core] Shutdown stage 'close': still running: <Task pending name='Task-2714' coro=<Recorder._async_close() running at /srv/homeassistant/lib/python3.12/site-packages/homeassistant/components/recorder/core.py:467> wait_for=<Future pending cb=[_chain_future.<locals>._call_check_cancel() at /usr/lib/python3.12/asyncio/futures.py:387, <1 more>, Task.task_wakeup()] created at /usr/lib/python3.12/asyncio/base_events.py:449> cb=[set.remove()] created at /srv/homeassistant/lib/python3.12/site-packages/homeassistant/util/async_.py:40>
2024-05-11 15:20:30.752 WARNING (Thread-4 (_do_shutdown)) [homeassistant.util.executor] Thread[SyncWorker_2] is still running at shutdown: File "/usr/lib/python3.12/threading.py", line 1030, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.12/threading.py", line 1010, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 92, in _worker
    work_item.run()
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/lib/python3.12/threading.py", line 1147, in join
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.12/threading.py", line 1167, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):
2024-05-11 15:20:30.919 WARNING (Thread-4 (_do_shutdown)) [homeassistant.util.executor] Thread[SyncWorker_4] is still running at shutdown: File "/usr/lib/python3.12/threading.py", line 1030, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.12/threading.py", line 1010, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 92, in _worker
    work_item.run()
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/lib/python3.12/threading.py", line 1147, in join
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.12/threading.py", line 1167, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):
2024-05-11 15:20:31.403 WARNING (Thread-4 (_do_shutdown)) [homeassistant.util.executor] Thread[SyncWorker_2] is still running at shutdown: File "/usr/lib/python3.12/threading.py", line 1030, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.12/threading.py", line 1010, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 92, in _worker
    work_item.run()
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/lib/python3.12/threading.py", line 1147, in join
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.12/threading.py", line 1167, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):
2024-05-11 15:20:31.887 WARNING (Thread-4 (_do_shutdown)) [homeassistant.util.executor] Thread[SyncWorker_4] is still running at shutdown: File "/usr/lib/python3.12/threading.py", line 1030, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.12/threading.py", line 1010, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 92, in _worker
    work_item.run()
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/lib/python3.12/threading.py", line 1147, in join
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.12/threading.py", line 1167, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):
2024-05-11 15:20:40.751 WARNING (MainThread) [homeassistant.util.executor] Thread[SyncWorker_2] is still running at shutdown: File "/usr/lib/python3.12/threading.py", line 1030, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.12/threading.py", line 1010, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 92, in _worker
    work_item.run()
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/lib/python3.12/threading.py", line 1147, in join
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.12/threading.py", line 1167, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):
2024-05-11 15:20:40.918 WARNING (MainThread) [homeassistant.util.executor] Thread[SyncWorker_4] is still running at shutdown: File "/usr/lib/python3.12/threading.py", line 1030, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.12/threading.py", line 1010, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 92, in _worker
    work_item.run()
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/lib/python3.12/threading.py", line 1147, in join
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.12/threading.py", line 1167, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):
2024-05-11 15:20:41.402 WARNING (MainThread) [homeassistant.util.executor] Thread[SyncWorker_2] is still running at shutdown: File "/usr/lib/python3.12/threading.py", line 1030, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.12/threading.py", line 1010, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 92, in _worker
    work_item.run()
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/lib/python3.12/threading.py", line 1147, in join
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.12/threading.py", line 1167, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):
2024-05-11 15:20:41.886 WARNING (MainThread) [homeassistant.util.executor] Thread[SyncWorker_4] is still running at shutdown: File "/usr/lib/python3.12/threading.py", line 1030, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.12/threading.py", line 1010, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 92, in _worker
    work_item.run()
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/lib/python3.12/threading.py", line 1147, in join
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.12/threading.py", line 1167, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):

Additional information

I thought maybe my database could be corrupted, so with HA shutdown I ran mv home-assistant_v2.db home-assistant_v2_old.db; sqlite3 home-assistant_v2_old.db ".recover" | sqlite3 home-assistant_v2.db and then tried to run a purge again. Unfortunately the problem was not resolved. My database did shrink by about 1.5 GB.

@home-assistant
Copy link

Hey there @home-assistant/core, mind taking a look at this issue as it has been labeled with an integration (recorder) you are listed as a code owner for? Thanks!

Code owner commands

Code owners of recorder can trigger bot actions by commenting:

  • @home-assistant close Closes the issue.
  • @home-assistant rename Awesome new title Renames the issue.
  • @home-assistant reopen Reopen the issue.
  • @home-assistant unassign recorder Removes the current integration label and assignees on the issue, add the integration domain after the command.
  • @home-assistant add-label needs-more-information Add a label (needs-more-information, problem in dependency, problem in custom component) to the issue.
  • @home-assistant remove-label needs-more-information Remove a label (needs-more-information, problem in dependency, problem in custom component) on the issue.

(message by CodeOwnersMention)


recorder documentation
recorder source
(message by IssueLinks)

@bdraco
Copy link
Member

bdraco commented May 11, 2024

It sounds like you have an integration that is filling the database with state changes so quickly that the system cannot keep up.

Try enabling debug logging for homeassistant.core and check which states are being changed very frequently

@HertogArjan
Copy link
Author

Hi, I enabled debug logging for homeassistant.core and while there are certainly a lot of state changes being reported, they do not increase notably after starting a purge when the CPU and I/O starts spiking.

I did just now install Home Assistant Container on a separate machine using the same configuration and noticed I could not reproduce the issue by running a manual purge. This could mean the issue is somehow caused by the installation environment.

After doing a manual purge on the Container version the database size has been reduced to 620 MB, nice.

When doing a purge on my original system with debug logging enabled for homeassistant.components.recorder all messages from the recorder stop abruptly after:

2024-05-11 18:29:27.435 DEBUG (Recorder) [homeassistant.components.recorder.purge] Selected 4000 state ids and 1059 attributes_ids to remove
2024-05-11 18:29:27.485 DEBUG (Recorder) [homeassistant.components.recorder.purge] Updated <sqlalchemy.engine.cursor.CursorResult object at 0x7f41add689f0> states to remove old_state_id
2024-05-11 18:29:27.521 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted <sqlalchemy.engine.cursor.CursorResult object at 0x7f41b4052270> states
2024-05-11 18:29:27.538 DEBUG (Recorder) [homeassistant.components.recorder.purge] Selected 4000 state ids and 1021 attributes_ids to remove
2024-05-11 18:29:27.601 DEBUG (Recorder) [homeassistant.components.recorder.purge] Updated <sqlalchemy.engine.cursor.CursorResult object at 0x7f41b4052270> states to remove old_state_id
2024-05-11 18:29:27.628 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted <sqlalchemy.engine.cursor.CursorResult object at 0x7f41add689f0> states
2024-05-11 18:29:27.640 DEBUG (Recorder) [homeassistant.components.recorder.purge] Selected 4000 state ids and 1027 attributes_ids to remove
2024-05-11 18:29:27.689 DEBUG (Recorder) [homeassistant.components.recorder.purge] Updated <sqlalchemy.engine.cursor.CursorResult object at 0x7f41b4775550> states to remove old_state_id
2024-05-11 18:29:27.716 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted <sqlalchemy.engine.cursor.CursorResult object at 0x7f41b4052270> states
2024-05-11 18:29:28.065 DEBUG (Recorder) [homeassistant.components.recorder.purge] Selected 25155 shared attributes to remove
2024-05-11 18:29:28.208 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted <sqlalchemy.engine.cursor.CursorResult object at 0x7f41b1745a20> attribute states
2024-05-11 18:29:28.301 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted <sqlalchemy.engine.cursor.CursorResult object at 0x7f41b4f81fd0> attribute states
2024-05-11 18:29:28.359 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted <sqlalchemy.engine.cursor.CursorResult object at 0x7f41b2c79da0> attribute states
2024-05-11 18:29:28.414 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted <sqlalchemy.engine.cursor.CursorResult object at 0x7f41b1747e70> attribute states
2024-05-11 18:29:28.467 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted <sqlalchemy.engine.cursor.CursorResult object at 0x7f41b2c79da0> attribute states
2024-05-11 18:29:28.520 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted <sqlalchemy.engine.cursor.CursorResult object at 0x7f41b4f81fd0> attribute states
2024-05-11 18:29:28.537 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted <sqlalchemy.engine.cursor.CursorResult object at 0x7f41b2c79da0> attribute states
2024-05-11 18:29:28.538 DEBUG (Recorder) [homeassistant.components.recorder.purge] After purging states and attributes_ids remaining=True
2024-05-11 18:29:28.736 DEBUG (Recorder) [homeassistant.components.recorder.purge] Selected 4000 event ids and 817 data_ids to remove

@bdraco
Copy link
Member

bdraco commented May 11, 2024

That sounds like a corrupt index

You might try running an integrity check https://www.sqlite.org/pragma.html#pragma_integrity_check on the database and reindex anything that comes up: https://www.sqlite.org/lang_reindex.html

It could also indicate a problem with your storage medium

@bdraco bdraco added database: sqlite3 problem in database The problem is with the database setup and cannot be fixed in Home Assistant labels May 11, 2024
@HertogArjan
Copy link
Author

It is pretty much certainly an issue with my installation of HA Core and not with the database or storage medium. I installed HA Container on the same machine and now the issue is gone.

Any suggestions why the issue only occurs with HA Core?

@bdraco
Copy link
Member

bdraco commented May 11, 2024

Do you have a different version of SQLite on the core install?

@HertogArjan
Copy link
Author

HertogArjan commented May 12, 2024

No because SQLite (the CLI tool) is not part of the Core dependencies. The Recorder page mentions Home Assistant uses SQLAlchemy for database access, but that module is not installed manually when installing Core. I expect HA installs this itself and also keeps it up to date. I don't know how I would compare the version numbers for this module.

@bdraco
Copy link
Member

bdraco commented May 12, 2024

You can find the version under setting, system, repairs, three dot menu, system information

@HertogArjan
Copy link
Author

Thanks @bdraco. According to the system information my Core install does indeed use a newer SQLite version than the Container install (3.45.3 > 3.44.2). It also uses a slightly newer Python version (3.12.3 > 3.12.2) for a slightly lower HA Core version (core-2024.5.2 < core-2024.5.3).

I can try and see what happens if I equalize the Core install to the Container install. Do you know how I would downgrade the SQLite version?

If you're interested I attach the system information outputs for the Core and Container installs:
sys_info_core.txt
sys_info_container.txt

@skandalfo
Copy link
Contributor

This is also happening for me now, a Core install.

I recently upgraded Ubuntu Server to 24.04. Also on Python 3.12.3. The SQLite3 library version installed is 3.45.1-1ubuntu2.

It gets locked up every morning at the programmed time (4:12?) and if I manually call the recorder.purge service. One "CPU" gets pegged to 100% by hass and stopping it drops the relevant lines:

2024-06-16 01:09:57.503 WARNING (MainThread) [homeassistant.core] Timed out waiting for final writes to complete, the shutdown will continue
2024-06-16 01:09:57.503 WARNING (MainThread) [homeassistant.core] Shutdown stage 'final write': still running: <Task pending name='Task-126987' coro=<Recorder._async_shutdown() running at /srv/homeassistant/lib/python3.12/site-packages/homeassistant/components/recorder/core.py:476> wait_for=<Future pending cb=[_chain_future.<locals>._call_check_cancel() at /usr/lib/python3.12/asyncio/futures.py:387, <1 more>, Task.task_wakeup()]> cb=[set.remove()]>

I installed the command-line sqlite3 tool and run VACUUM and PRAGMA integrity_check; with no issues on the database file (all said "ok").

@skandalfo
Copy link
Contributor

I'm assuming this line:

deleted_rows = session.execute(delete_event_rows(event_ids))
is the one causing the hang, as we never see the following log entry at
_LOGGER.debug("Deleted %s events", deleted_rows)
.

I also got logs that look the same once debugging / logging level debug are enabled:

...
2024-06-16 01:23:16.430 DEBUG (Recorder) [homeassistant.components.recorder.purge] Selected 4000 state ids and 326 attributes_ids to remove
2024-06-16 01:23:16.605 DEBUG (Recorder) [homeassistant.components.recorder.purge] Updated <sqlalchemy.engine.cursor.CursorResult object at 0x7dc9b4dd2740> states to remove old_state_id
2024-06-16 01:23:16.723 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted <sqlalchemy.engine.cursor.CursorResult object at 0x7dc9b4dd2e40> states
2024-06-16 01:23:16.783 DEBUG (Recorder) [homeassistant.components.recorder.purge] Selected 4000 state ids and 380 attributes_ids to remove
2024-06-16 01:23:16.958 DEBUG (Recorder) [homeassistant.components.recorder.purge] Updated <sqlalchemy.engine.cursor.CursorResult object at 0x7dc9b4dd8670> states to remove old_state_id
2024-06-16 01:23:17.070 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted <sqlalchemy.engine.cursor.CursorResult object at 0x7dc9b4dd2740> states
2024-06-16 01:23:17.127 DEBUG (Recorder) [homeassistant.components.recorder.purge] Selected 4000 state ids and 346 attributes_ids to remove
2024-06-16 01:23:17.297 DEBUG (Recorder) [homeassistant.components.recorder.purge] Updated <sqlalchemy.engine.cursor.CursorResult object at 0x7dc9b4dd8670> states to remove old_state_id
2024-06-16 01:23:17.412 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted <sqlalchemy.engine.cursor.CursorResult object at 0x7dc9b4dd29e0> states
2024-06-16 01:23:17.444 DEBUG (Recorder) [homeassistant.components.recorder.purge] Selected 1915 shared attributes to remove
2024-06-16 01:23:17.482 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted <sqlalchemy.engine.cursor.CursorResult object at 0x7dc9b4ddacf0> attribute states
2024-06-16 01:23:17.484 DEBUG (Recorder) [homeassistant.components.recorder.purge] After purging states and attributes_ids remaining=True
2024-06-16 01:23:17.550 DEBUG (Recorder) [homeassistant.components.recorder.purge] Selected 4000 event ids and 32 data_ids to remove

@skandalfo
Copy link
Contributor

skandalfo commented Jun 16, 2024

Partial thread dump via pyrasite (see https://stackoverflow.com/questions/6849138/check-what-a-running-process-is-doing-print-stack-trace-of-an-uninstrumented-py):

16 01:59:04 piernas hass[116758]:   File "/usr/lib/python3.12/threading.py", line 1030, in _bootstrap
jun 16 01:59:04 piernas hass[116758]:     self._bootstrap_inner()
jun 16 01:59:04 piernas hass[116758]:   File "/usr/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
jun 16 01:59:04 piernas hass[116758]:     self.run()
jun 16 01:59:04 piernas hass[116758]:   File "/srv/homeassistant/lib/python3.12/site-packages/pyudev/monitor.py", line 533, in run
jun 16 01:59:04 piernas hass[116758]:     for file_descriptor, event in eintr_retry_call(notifier.poll):
jun 16 01:59:04 piernas hass[116758]:   File "/srv/homeassistant/lib/python3.12/site-packages/pyudev/_util.py", line 152, in eintr_retry_call
jun 16 01:59:04 piernas hass[116758]:     return func(*args, **kwargs)
jun 16 01:59:04 piernas hass[116758]:   File "/srv/homeassistant/lib/python3.12/site-packages/pyudev/_os/poll.py", line 91, in poll
jun 16 01:59:04 piernas hass[116758]:     return list(self._parse_events(eintr_retry_call(self._notifier.poll, timeout)))
jun 16 01:59:04 piernas hass[116758]:   File "/srv/homeassistant/lib/python3.12/site-packages/pyudev/_util.py", line 152, in eintr_retry_call
jun 16 01:59:04 piernas hass[116758]:     return func(*args, **kwargs)
jun 16 01:59:04 piernas hass[116758]:   File "/usr/lib/python3.12/threading.py", line 1030, in _bootstrap
jun 16 01:59:04 piernas hass[116758]:     self._bootstrap_inner()
jun 16 01:59:04 piernas hass[116758]:   File "/usr/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
jun 16 01:59:04 piernas hass[116758]:     self.run()
jun 16 01:59:04 piernas hass[116758]:   File "/srv/homeassistant/lib/python3.12/site-packages/homeassistant/components/recorder/core.py", line 703, in run
jun 16 01:59:04 piernas hass[116758]:     self._run()
jun 16 01:59:04 piernas hass[116758]:   File "/srv/homeassistant/lib/python3.12/site-packages/homeassistant/components/recorder/core.py", line 787, in _run
jun 16 01:59:04 piernas hass[116758]:     self._run_event_loop()
jun 16 01:59:04 piernas hass[116758]:   File "/srv/homeassistant/lib/python3.12/site-packages/homeassistant/components/recorder/core.py", line 872, in _run_event_loop
jun 16 01:59:04 piernas hass[116758]:     self._guarded_process_one_task_or_event_or_recover(queue_.get())
jun 16 01:59:04 piernas hass[116758]:   File "/srv/homeassistant/lib/python3.12/site-packages/homeassistant/components/recorder/core.py", line 906, in _guarded_process_one_task_or_event_or_recover
jun 16 01:59:04 piernas hass[116758]:     self._process_one_task_or_event_or_recover(task)
jun 16 01:59:04 piernas hass[116758]:   File "/srv/homeassistant/lib/python3.12/site-packages/homeassistant/components/recorder/core.py", line 926, in _process_one_task_or_event_or_recover
jun 16 01:59:04 piernas hass[116758]:     task.run(self)
jun 16 01:59:04 piernas hass[116758]:   File "/srv/homeassistant/lib/python3.12/site-packages/homeassistant/components/recorder/tasks.py", line 114, in run
jun 16 01:59:04 piernas hass[116758]:     if purge.purge_old_data(
jun 16 01:59:04 piernas hass[116758]:   File "/srv/homeassistant/lib/python3.12/site-packages/homeassistant/components/recorder/util.py", line 643, in wrapper
jun 16 01:59:04 piernas hass[116758]:     return job(instance, *args, **kwargs)
jun 16 01:59:04 piernas hass[116758]:   File "/srv/homeassistant/lib/python3.12/site-packages/homeassistant/components/recorder/purge.py", line 92, in purge_old_data
jun 16 01:59:04 piernas hass[116758]:     has_more_to_purge |= _purge_events_and_data_ids(
jun 16 01:59:04 piernas hass[116758]:   File "/srv/homeassistant/lib/python3.12/site-packages/homeassistant/components/recorder/purge.py", line 237, in _purge_events_and_data_ids
jun 16 01:59:04 piernas hass[116758]:     _purge_event_ids(session, event_ids)
jun 16 01:59:04 piernas hass[116758]:   File "/srv/homeassistant/lib/python3.12/site-packages/homeassistant/components/recorder/purge.py", line 577, in _purge_event_ids
jun 16 01:59:04 piernas hass[116758]:     deleted_rows = session.execute(delete_event_rows(event_ids))
jun 16 01:59:04 piernas hass[116758]:   File "/srv/homeassistant/lib/python3.12/site-packages/sqlalchemy/orm/session.py", line 2351, in execute
jun 16 01:59:04 piernas hass[116758]:     return self._execute_internal(
jun 16 01:59:04 piernas hass[116758]:   File "/srv/homeassistant/lib/python3.12/site-packages/sqlalchemy/orm/session.py", line 2236, in _execute_internal
jun 16 01:59:04 piernas hass[116758]:     result: Result[Any] = compile_state_cls.orm_execute_statement(
jun 16 01:59:04 piernas hass[116758]:   File "/srv/homeassistant/lib/python3.12/site-packages/sqlalchemy/orm/bulk_persistence.py", line 1953, in orm_execute_statement
jun 16 01:59:04 piernas hass[116758]:     return super().orm_execute_statement(
jun 16 01:59:04 piernas hass[116758]:   File "/srv/homeassistant/lib/python3.12/site-packages/sqlalchemy/orm/context.py", line 293, in orm_execute_statement
jun 16 01:59:04 piernas hass[116758]:     result = conn.execute(
jun 16 01:59:04 piernas hass[116758]:   File "/srv/homeassistant/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1418, in execute
jun 16 01:59:04 piernas hass[116758]:     return meth(
jun 16 01:59:04 piernas hass[116758]:   File "/srv/homeassistant/lib/python3.12/site-packages/sqlalchemy/sql/lambdas.py", line 603, in _execute_on_connection
jun 16 01:59:04 piernas hass[116758]:     return connection._execute_clauseelement(
jun 16 01:59:04 piernas hass[116758]:   File "/srv/homeassistant/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1640, in _execute_clauseelement
jun 16 01:59:04 piernas hass[116758]:     ret = self._execute_context(
jun 16 01:59:04 piernas hass[116758]:   File "/srv/homeassistant/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1846, in _execute_context
jun 16 01:59:04 piernas hass[116758]:     return self._exec_single_context(
jun 16 01:59:04 piernas hass[116758]:   File "/srv/homeassistant/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1967, in _exec_single_context
jun 16 01:59:04 piernas hass[116758]:     self.dialect.do_execute(
jun 16 01:59:04 piernas hass[116758]:   File "/srv/homeassistant/lib/python3.12/site-packages/sqlalchemy/engine/default.py", line 924, in do_execute
jun 16 01:59:04 piernas hass[116758]:     cursor.execute(statement, parameters)
jun 16 01:59:04 piernas hass[116758]:   File "<string>", line 1, in <module>
jun 16 01:59:04 piernas hass[116758]:   File "<string>", line 5, in <module>
jun 16 01:59:04 piernas hass[116758]:   File "/usr/lib/python3.12/threading.py", line 1030, in _bootstrap
jun 16 01:59:04 piernas hass[116758]:     self._bootstrap_inner()
jun 16 01:59:04 piernas hass[116758]:   File "/usr/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
jun 16 01:59:04 piernas hass[116758]:     self.run()
jun 16 01:59:04 piernas hass[116758]:   File "/usr/lib/python3.12/threading.py", line 1010, in run
jun 16 01:59:04 piernas hass[116758]:     self._target(*self._args, **self._kwargs)
jun 16 01:59:04 piernas hass[116758]:   File "/usr/lib/python3.12/concurrent/futures/thread.py", line 89, in _worker
jun 16 01:59:04 piernas hass[116758]:     work_item = work_queue.get(block=True)
jun 16 01:59:04 piernas hass[116758]:   File "/usr/lib/python3.12/threading.py", line 1030, in _bootstrap
jun 16 01:59:04 piernas hass[116758]:     self._bootstrap_inner()
jun 16 01:59:04 piernas hass[116758]:   File "/usr/lib/python3.12/threading.py", line 1073, in _bootstrap_inner

Not sure what the two <string> pieces of the stack mean... Dynamically generated Python code being executed?

@skandalfo
Copy link
Contributor

I stoped HA and opened the database with the commandline sqlite3 tool.

So, I think I had some entry in the events table with an empty-string data_id (timestamp obtained via date +%s --date="7 days ago"):

sqlite> select count(*), data_id from events where time_fired_ts < 1717965456 group by data_id;
2118|
3|22046
3|22047
3|22048
...

I just deleted the relevant events:

sqlite> delete from events where time_fired_ts < 1717965456;

And verified that no unreferenced event_data entries remained:

sqlite> select distinct(data_id) from event_data where data_id not in (select distinct(data_id) from events);

Calling recorder.purge after that completed successfully.

@skandalfo
Copy link
Contributor

Unfortunately my instance got back to getting stuck at purge the second day after that. At this point my next steps are probably going to add debug logging to the diverse points of the above stack trace the next time this reproduces (I cleaned the database manually again this time).

@bdraco
Copy link
Member

bdraco commented Jun 28, 2024

We found the problem

Some old databases have FOREIGN KEY(event_id) REFERENCES events (event_id) ON DELETE CASCADE, in the states table. The index used for the foreign key is now empty on some systems, which may result in a full table scan with the version of SQLite included in Alpine 3.20.x when deleting rows from the states table. The problem became more widespread when the docker images were updated from Alpine 3.19 to 3.20. Other operating systems may have updated SQLite sooner, so this problem also appeared sooner on those systems. Downgrading SQLite or disabling foreign keys will also likely work around the issue but this is not recommended.

Unfortunately SQLite does not support dropping a foreign key constraint (see Why ALTER TABLE is such a problem for SQLite at https://www.sqlite.org/lang_altertable.html) which makes the solution much more complex.

To fix this the whole states table needs to be recreated using the steps here: https://www.sqlite.org/lang_altertable.html

  1. Making Other Kinds Of Table Schema Changes

@bdraco
Copy link
Member

bdraco commented Jun 28, 2024

if instance.dialect_name != SupportedDialect.SQLITE:

@bdraco
Copy link
Member

bdraco commented Jun 28, 2024

so we need write a rebuild_table function for sqlite to fix this.

Sadly writing that code is very risky if we get it wrong. Frankly its scary

Going to go with the 12 step rebuild process.. Its slower but less risky.

@bdraco
Copy link
Member

bdraco commented Jun 29, 2024

Will be fixed in 2024.8.0

@ChristophCaina
Copy link
Contributor

cool,
Just noticed the same behave on my system for the last two days during the tests of 2024.7.0bx
Not sure, what exactly triggered this issue to happen now - while testing the beta - until then, the recorder was working without issues.

@skandalfo
Copy link
Contributor

skandalfo commented Jun 30, 2024

I'm not sure this is the actual problem in my case.

sqlite> .schema event_data
CREATE TABLE event_data (
        data_id INTEGER NOT NULL,
        hash BIGINT,
        shared_data TEXT,
        PRIMARY KEY (data_id)
);
CREATE INDEX ix_event_data_hash ON event_data (hash);
sqlite> .schema events
CREATE TABLE events (
        event_id INTEGER NOT NULL,
        event_type VARCHAR(32),
        event_data TEXT,
        origin VARCHAR(32),
        time_fired DATETIME,
        created DATETIME,
        context_id VARCHAR(36),
        context_user_id VARCHAR(36),
        context_parent_id VARCHAR(36), data_id INTEGER, origin_idx INTEGER, time_fired_ts FLOAT, context_id_bin BLOB, context_user_id_bin BLOB, context_parent_id_bin BLOB, event_type_id INTEGER,
        PRIMARY KEY (event_id)
);
CREATE INDEX ix_events_data_id ON events (data_id);CREATE INDEX ix_events_time_fired_ts ON events (time_fired_ts);
CREATE INDEX ix_events_context_id_bin ON events (context_id_bin);
CREATE INDEX ix_events_event_type_id_time_fired_ts ON events (event_type_id, time_fired_ts);
sqlite>

I can't see any foreign key there...

Edit: removed spurious extra copied text.

@bdraco
Copy link
Member

bdraco commented Jun 30, 2024

I'm not sure this is the actual problem in my case.

sqlite> .schema event_data
CREATE TABLE event_data (
        data_id INTEGER NOT NULL,
        hash BIGINT,
        shared_data TEXT,
        PRIMARY KEY (data_id)
);
CREATE INDEX ix_event_data_hash ON event_data (hash);
sqlite> .schema events
CREATE TABLE events (
        event_id INTEGER NOT NULL,
        event_type VARCHAR(32),
        event_data TEXT,
        origin VARCHAR(32),
        time_fired DATETIME,
        created DATETIME,
        context_id VARCHAR(36),
        context_user_id VARCHAR(36),
        context_parent_id VARCHAR(36), data_id INTEGER, origin_idx INTEGER, time_fired_ts FLOAT, context_id_bin BLOB, context_user_id_bin BLOB, context_parent_id_bin BLOB, event_type_id INTEGER,
        PRIMARY KEY (event_id)
);
CREATE INDEX ix_events_data_id ON events (data_id);CREATE INDEX ix_events_time_fired_ts ON events (time_fired_ts);
CREATE INDEX ix_events_context_id_bin ON events (context_id_bin);
CREATE INDEX ix_events_event_type_id_time_fired_ts ON events (event_type_id, time_fired_ts);
sqlite>

I can't see any foreign key there...

Edit: removed spurious extra copied text.

Please check again as I looks like that is not the right database file unless you are running a 2-3 year old version of HA

@skandalfo
Copy link
Contributor

This instance was installed originally more than two years ago and has been migrated to the latest version each time via pip3.

home-assistant_v2.db is the database file I've been opening to operate and mitigate this issue with the sqlite3 command line utility above (and my SQL statements on it "did things"):

homeassistant@piernas ~/.homeassistant (master)> ls -la *.db
-rw-r--r-- 1 homeassistant homeassistant 833134592 jun 30 15:22 home-assistant_v2.db
-rw-r--r-- 1 homeassistant homeassistant   1384448 jun 30 15:18 zigbee.db
homeassistant@piernas ~/.homeassistant (master)>

Isn't this the right database file? What other name should I look for?

@bdraco
Copy link
Member

bdraco commented Jun 30, 2024

Sure looks like its been changed recently. That's very strange that is missing all the foreign keys.

I'm not sure whats going on with your system.

events should look like this

sqlite> .schema events
CREATE TABLE events (
	event_id INTEGER NOT NULL, 
	event_type VARCHAR(64), 
	event_data TEXT, 
	origin VARCHAR(32), 
	time_fired DATETIME, 
	context_id VARCHAR(36), 
	context_user_id VARCHAR(36), 
	context_parent_id VARCHAR(36), 
	data_id INTEGER, origin_idx INTEGER, time_fired_ts FLOAT, context_id_bin BLOB, context_parent_id_bin BLOB, context_user_id_bin BLOB, event_type_id INTEGER, 
	PRIMARY KEY (event_id), 
	FOREIGN KEY(data_id) REFERENCES event_data (data_id)
);
CREATE INDEX ix_events_data_id ON events (data_id);
CREATE INDEX ix_events_time_fired_ts ON events (time_fired_ts);
CREATE INDEX ix_events_context_id_bin ON events (context_id_bin);
CREATE INDEX ix_events_event_type_id_time_fired_ts ON events (event_type_id, time_fired_ts);

@bdraco
Copy link
Member

bdraco commented Jul 11, 2024

So.. my db file is 16.0 gb and even after removing all the local backup files, I only have 24 gb free. This doesn't seem to have been a problem with previous monthly purges though, so why is it suddenly a problem now?

24 GB is likely enough. It should not be a problem, but is always better to have 2x the database size in case a table rebuild fails.

I don't see any table rebuilds in your logs so either the table rebuild already happened, or there is something wrong with your database that prevents it from getting to that point.

You can check to see if the table rebuild already happened by manually running the following

% sqlite3 home-assistant_v2.db
SQLite version 3.43.2 2023-10-10 13:08:14
Enter ".help" for usage hints.
sqlite> .schema states
CREATE TABLE IF NOT EXISTS "states" (
	"state_id" INTEGER NOT NULL  ,
	"entity_id" CHARACTER(0) NULL  ,
	"state" VARCHAR(255) NULL  ,
	"attributes" CHARACTER(0) NULL  ,
	"event_id" SMALLINT NULL  ,
	"last_changed" CHARACTER(0) NULL  ,
	"last_changed_ts" DOUBLE NULL  ,
	"last_updated" CHARACTER(0) NULL  ,
	"last_updated_ts" DOUBLE NULL  ,
	"old_state_id" INTEGER NULL  ,
	"attributes_id" INTEGER NULL  ,
	"context_id" CHARACTER(0) NULL  ,
	"context_user_id" CHARACTER(0) NULL  ,
	"context_parent_id" CHARACTER(0) NULL  ,
	"origin_idx" SMALLINT NULL  ,
	"context_id_bin" BLOB NULL  ,
	"context_user_id_bin" BLOB NULL  ,
	"context_parent_id_bin" BLOB NULL  ,
	"metadata_id" INTEGER NULL  , last_reported_ts FLOAT,
	PRIMARY KEY ("state_id"),
	FOREIGN KEY("old_state_id") REFERENCES "states" ("state_id") ON UPDATE RESTRICT ON DELETE RESTRICT,
	FOREIGN KEY("attributes_id") REFERENCES "state_attributes" ("attributes_id") ON UPDATE RESTRICT ON DELETE RESTRICT,
	FOREIGN KEY("metadata_id") REFERENCES "states_meta" ("metadata_id") ON UPDATE RESTRICT ON DELETE RESTRICT
);
CREATE INDEX "ix_states_attributes_id" ON "states" ("attributes_id");
CREATE INDEX "ix_states_context_id_bin" ON "states" ("context_id_bin");
CREATE INDEX "ix_states_last_updated_ts" ON "states" ("last_updated_ts");
CREATE INDEX "ix_states_metadata_id_last_updated_ts" ON "states" ("metadata_id", "last_updated_ts");
CREATE INDEX "ix_states_old_state_id" ON "states" ("old_state_id");
sqlite> 

If you only see the 3 FOREIGN KEYs, the problem one has been removed and the rebuild is complete.

If its still there, something else is wrong, and you'll need to enable debug logging for sqlalchemy and restart to get an idea what is going on.

@kchiem
Copy link

kchiem commented Jul 11, 2024

I only see 2 FOREIGN KEYs:

# sqlite3 home-assistant_v2.db
SQLite version 3.44.2 2023-11-24 11:41:44
Enter ".help" for usage hints.
sqlite> .schema states
CREATE TABLE states (
        state_id INTEGER NOT NULL,
        domain VARCHAR(64),
        entity_id VARCHAR(255),
        state VARCHAR(255),
        attributes TEXT,
        event_id INTEGER,
        last_changed DATETIME,
        last_updated DATETIME,
        created DATETIME,
        old_state_id INTEGER, attributes_id INTEGER, origin_idx INTEGER, context_id VARCHAR(36), context_user_id VARCHAR(36), context_parent_id VARCHAR(36), last_updated_ts FLOAT, last_changed_ts FLOAT, context_id_bin BLOB, context_user_id_bin BLOB, context_parent_id_bin BLOB, metadata_id INTEGER, last_reported_ts FLOAT,
        PRIMARY KEY (state_id),
        FOREIGN KEY(event_id) REFERENCES events (event_id) ON DELETE CASCADE,
        FOREIGN KEY(old_state_id) REFERENCES states (state_id)
);
CREATE INDEX ix_states_old_state_id ON states (old_state_id);
CREATE INDEX ix_states_attributes_id ON states (attributes_id);
CREATE INDEX ix_states_last_updated_ts ON states (last_updated_ts);
CREATE INDEX ix_states_context_id_bin ON states (context_id_bin);
CREATE INDEX ix_states_metadata_id_last_updated_ts ON states (metadata_id, last_updated_ts);

@bdraco
Copy link
Member

bdraco commented Jul 11, 2024

I only see 2 FOREIGN KEYs:

It looks like you don't have this issue. There should be a FK on metadata_id but it shouldn't cause any issues if its missing.

@kchiem
Copy link

kchiem commented Jul 11, 2024

It looks like you don't have this issue. There should be a FK on metadata_id but it shouldn't cause any issues if its missing.

Then why does the recorder purge cause my graphs to flatline?

@bdraco
Copy link
Member

bdraco commented Jul 11, 2024

It looks like you don't have this issue. There should be a FK on metadata_id but it shouldn't cause any issues if its missing.

Then why does the recorder purge cause my graphs to flatline?

While the symptoms are the same, since the problematic foreign key is not there, its a different problem which will require a different solution. Please start a fresh issue report with debug logs for sqlalchemy turned on.

# Example configuration.yaml entry
logger:
  default: info
  logs:
    sqlalchemy: debug

@steerage250
Copy link

2024.7.2 has certainly solved the "recorder locking-up" problem, but something is still not right in my system.

The home-assistant_v2.db file grew from 3.2GB to 3.8GB after the upgrade, but has been static in size over the last 5 days.
Despite having purge_keep_days: 10, I can select data that is 20 days old.

I have the following in configuration.yaml:
recorder:
purge_keep_days: 10
#auto_purge: false

@jasonwfrancis1
Copy link

Hey, I don't really know what to do, I'm having the same problem as above. recorder stops every day at4:30 and i have to reboot. I updated to 2024.7.2 and its the same. When i look in SQLite the temp table for events existed so it seems the migration did not complete. When i save the DB and try to do the migration externally it fails copying the data from the old table to the new because of "unique constraints". I really do not want to lose all of my energy data. I'm super annoyed that it was all working fine until a recent update and now I'm spending hours trying to fix it. Any ideas of where i might go for help? I tied the custom component but it didn't do anything (i assume it thinks the migration was completed).

@7w1z4rd
Copy link

7w1z4rd commented Jul 16, 2024

I also have the problem, just like jasonwfrancis1

@disconn3ct
Copy link

As an added datapoint, I have been having this issue on mariadb. Upgrading didn't fix it.

@jasonwfrancis1
Copy link

i managed to fix it myself, i basically migrated to maria DB and used a good tutorial to migrate all of the data. a few gaps while i was doing the whole process but i think im back up and running properly. just kicked off a purge, and its running...fingers crossed.

@matomatusov
Copy link

Hi @bdraco

After some time, my database keeps increasing. Don't know how to reduce it? I don't know if your corrections work correctly.

1

@Sebazzz
Copy link

Sebazzz commented Jul 20, 2024

Hey, I don't really know what to do, I'm having the same problem as above. recorder stops every day at4:30 and i have to reboot. I updated to 2024.7.2 and its the same. When i look in SQLite the temp table for events existed so it seems the migration did not complete. When i save the DB and try to do the migration externally it fails copying the data from the old table to the new because of "unique constraints". I really do not want to lose all of my energy data. I'm super annoyed that it was all working fine until a recent update and now I'm spending hours trying to fix it. Any ideas of where i might go for help? I tied the custom component but it didn't do anything (i assume it thinks the migration was completed).

Also having this problem, even with 2024.7.2:

# sqlite3 home-assistant_v2.db
SQLite version 3.44.2 2023-11-24 11:41:44
Enter ".help" for usage hints.
sqlite> .schema event_data
CREATE TABLE event_data (
        data_id INTEGER NOT NULL,
        hash BIGINT,
        shared_data TEXT,
        PRIMARY KEY (data_id)
);
CREATE INDEX ix_event_data_hash ON event_data (hash);
sqlite> .schema events
CREATE TABLE events (
        event_id INTEGER NOT NULL,
        event_type VARCHAR(64),
        event_data TEXT,
        origin VARCHAR(32),
        time_fired DATETIME,
        created DATETIME,
        context_id VARCHAR(36),
        context_user_id VARCHAR(36),
        context_parent_id VARCHAR(36), data_id INTEGER, origin_idx INTEGER, time_fired_ts FLOAT, context_id_bin BLOB, context_user_id_bin BLOB, context_parent_id_bin BLOB, event_type_id INTEGER,
        PRIMARY KEY (event_id)
);
CREATE INDEX ix_events_data_id ON events (data_id);
CREATE INDEX ix_events_time_fired_ts ON events (time_fired_ts);
CREATE INDEX ix_events_context_id_bin ON events (context_id_bin);
CREATE INDEX ix_events_event_type_id_time_fired_ts ON events (event_type_id, time_fired_ts);
sqlite> .tables
event_data              schema_changes          states_temp_1720627841
event_types             state_attributes        statistics
events                  states                  statistics_meta
migration_changes       states_meta             statistics_runs
recorder_runs           states_temp_1720625711  statistics_short_term

It happens some time after the purge of 4 AM. However, events are no longer recorded after 4 AM.

2024-07-20 08:15:58.622 ERROR (MainThread) [homeassistant.components.recorder.core] The recorder backlog queue reached the maximum size of 255442 events; usually, the system is CPU bound, I/O bound, or the database is corrupt due to a disk problem; The recorder will stop recording events to avoid running out of memory

Until 04:12:xx log messages like this:

2024-07-20 04:10:10.268 INFO (Recorder) [sqlalchemy.engine.Engine] [insertmanyvalues 9/11 (ordered; batch not supported)] (None, '90.9', None, None, None, None, None, None, 1721441409.5084128, 1225118537, 12273833, None, None, None, 0, <memory at 0x7f942b3d71c0>, None, None, 185170)
2024-07-20 04:10:10.269 INFO (Recorder) [sqlalchemy.engine.Engine] INSERT INTO states (entity_id, state, attributes, event_id, last_changed, last_changed_ts, last_reported_ts, last_updated, last_updated_ts, old_state_id, attributes_id, context_id, context_user_id, context_parent_id, origin_idx, context_id_bin, context_user_id_bin, context_parent_id_bin, metadata_id) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?) RETURNING state_id
2024-07-20 04:10:10.269 INFO (Recorder) [sqlalchemy.engine.Engine] [insertmanyvalues 10/11 (ordered; batch not supported)] (None, '5.309', None, None, None, None, None, None, 1721441410.0638864, 1225118451, 338851, None, None, None, 0, <memory at 0x7f942b3d5600>, None, None, 4)
2024-07-20 04:10:10.269 INFO (Recorder) [sqlalchemy.engine.Engine] INSERT INTO states (entity_id, state, attributes, event_id, last_changed, last_changed_ts, last_reported_ts, last_updated, last_updated_ts, old_state_id, attributes_id, context_id, context_user_id, context_parent_id, origin_idx, context_id_bin, context_user_id_bin, context_parent_id_bin, metadata_id) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?) RETURNING state_id
2024-07-20 04:10:10.270 INFO (Recorder) [sqlalchemy.engine.Engine] [insertmanyvalues 11/11 (ordered; batch not supported)] (None, '5.309', None, None, None, None, None, None, 1721441410.0642517, 1225118452, 338852, None, None, None, 0, <memory at 0x7f942b3d7640>, None, None, 39)
2024-07-20 04:10:10.271 INFO (Recorder) [sqlalchemy.engine.Engine] COMMIT

Then:

2024-07-20 04:12:00.160 INFO (Recorder) [sqlalchemy.engine.Engine] BEGIN (implicit)
2024-07-20 04:12:00.162 INFO (Recorder) [sqlalchemy.engine.Engine] SELECT states.state_id, states.attributes_id 
FROM states 
WHERE states.last_updated_ts < ?
 LIMIT ? OFFSET ?
2024-07-20 04:12:00.162 INFO (Recorder) [sqlalchemy.engine.Engine] [generated in 0.00029s] (1718849520.132695, 4000, 0)
2024-07-20 04:12:00.285 INFO (Recorder) [sqlalchemy.engine.Engine] UPDATE states SET old_state_id=? WHERE states.old_state_id IN ([many])
2024-07-20 04:12:00.285 INFO (Recorder) [sqlalchemy.engine.Engine] [generated in 0.01305s] (None, 1139875840, 1139875841, 1139875842, 1139875843, 1139875844, 1139875845, 1139875846, 1139875847, 1139875848, 1139875849, 1139875850, 1139875851, 1139875852, 1139875853, 1139875854, 1139875855, 1139875856, 1139875857, 1139875858, 1139875859, 1139875860, 1139875861, 1139875862, 1139875863, 1139875864, 1139875865, 1139875866, 1139875867, 1139875868, 1139875869, 1139875870, 1139875871, 1139875872, 1139875873, 1139875874, 1139875875, 1139875876, 1139875877, 1139875878, 1139875879, 1139875880, 1139875881, 1139875882, 1139875883, 1139875884, 1139875885, 1139875886, 1139875887, 1139875888 ... 3901 parameters truncated ... 1139875790, 1139875791, 1139875792, 1139875793, 1139875794, 1139875795, 1139875796, 1139875797, 1139875798, 1139875799, 1139875800, 1139875801, 1139875802, 1139875803, 1139875804, 1139875805, 1139875806, 1139875807, 1139875808, 1139875809, 1139875810, 1139875811, 1139875812, 1139875813, 1139875814, 1139875815, 1139875816, 1139875817, 1139875818, 1139875819, 1139875820, 1139875821, 1139875822, 1139875823, 1139875824, 1139875825, 1139875826, 1139875827, 1139875828, 1139875829, 1139875830, 1139875831, 1139875832, 1139875833, 1139875834, 1139875835, 1139875836, 1139875837, 1139875838, 1139875839)
2024-07-20 04:12:00.402 INFO (Recorder) [sqlalchemy.engine.Engine] DELETE FROM states WHERE states.state_id IN ([many])
2024-07-20 04:12:00.402 INFO (Recorder) [sqlalchemy.engine.Engine] [generated in 0.00377s] (1139875840, 1139875841, 1139875842, 1139875843, 1139875844, 1139875845, 1139875846, 1139875847, 1139875848, 1139875849, 1139875850, 1139875851, 1139875852, 1139875853, 1139875854, 1139875855, 1139875856, 1139875857, 1139875858, 1139875859, 1139875860, 1139875861, 1139875862, 1139875863, 1139875864, 1139875865, 1139875866, 1139875867, 1139875868, 1139875869, 1139875870, 1139875871, 1139875872, 1139875873, 1139875874, 1139875875, 1139875876, 1139875877, 1139875878, 1139875879, 1139875880, 1139875881, 1139875882, 1139875883, 1139875884, 1139875885, 1139875886, 1139875887, 1139875888, 1139875889 ... 3900 parameters truncated ... 1139875790, 1139875791, 1139875792, 1139875793, 1139875794, 1139875795, 1139875796, 1139875797, 1139875798, 1139875799, 1139875800, 1139875801, 1139875802, 1139875803, 1139875804, 1139875805, 1139875806, 1139875807, 1139875808, 1139875809, 1139875810, 1139875811, 1139875812, 1139875813, 1139875814, 1139875815, 1139875816, 1139875817, 1139875818, 1139875819, 1139875820, 1139875821, 1139875822, 1139875823, 1139875824, 1139875825, 1139875826, 1139875827, 1139875828, 1139875829, 1139875830, 1139875831, 1139875832, 1139875833, 1139875834, 1139875835, 1139875836, 1139875837, 1139875838, 1139875839)
2024-07-20 04:12:00.433 INFO (Recorder) [sqlalchemy.engine.Engine] SELECT states.state_id, states.attributes_id 
FROM states 
WHERE states.last_updated_ts < ?

And then nothing from the recorder anymore.

@PatrikNorrgard
Copy link

I am also still having this issue with 2024.7.4, recorder hangs after 4 in the morning and a reboot gets it going again.
I do however have a longer keep value since some months back:
recorder:
purge_keep_days: 30
commit_interval: 0

It was still working fine until I updated to 2024.7

@alexwhittemore
Copy link

Same over here, I’m on 2024.7.4 but recorder reliably does at 4am every day until a restart. What debug info can I provide to the effort?

@Sinandgrin
Copy link

Still having this issue here as well, running 2024.8. I do have the recorder set to keep for 1 year, since I wanted the history but as mentioned above (#117263 (comment)) this was working fine up until last month (07/19 for me but I can't recall when or what version I may have updated to at that time).

My recorder settings in configuration.yaml
recorder:
purge_keep_days: 365

@alexwhittemore
Copy link

I have the same configuration, with 365 days of retention, but after upgrading to 2024.8 yesterday it seems my recorder didn't die last night at 4am!

@mobot95
Copy link

mobot95 commented Aug 9, 2024

Same issue with 2024.8

@tomas-dorda
Copy link

I have default configuration and still have the issue 2024.8

@hapklaar
Copy link

hapklaar commented Aug 9, 2024

After upgrading to 2024.8 and re-enabling auto_purge, today I see my graphs have stopped updating at 4 am again. SDD is at 100% and crunching away. After restarting HA, graphs are back, but with no data from 4am till restart.

My recorder settings:

recorder:
  purge_keep_days: 90
  auto_purge: true
  exclude:
    domains: 
      - updater
      - media_player
      - device_tracker
      - weather
      - sun

@bdraco tagged as you provided the fix and closed the issue. Could it be the issue was not fixed in certain cases and can it be reopened?

EDIT: added some info

@disconn3ct
Copy link

SDD is at 100% and crunching away

Run a manual purge with your retention time (1 year) and wait for it to stop crunching before restarting.
Mine has made it 3 days since doing that, still running 2024.07

@Sebazzz
Copy link

Sebazzz commented Aug 9, 2024

I wish there was a way to do a manual migration. That would be fine for me.

@hapklaar
Copy link

hapklaar commented Aug 9, 2024

SDD is at 100% and crunching away

Run a manual purge with your retention time (1 year) and wait for it to stop crunching before restarting. Mine has made it 3 days since doing that, still running 2024.07

I'm currently doing this (90 days retention). It has been running for 3 hours now, with 100% disk usage. I assume this is not any different than auto purge? (which had been running for 8 hours without ending before I restarted)

@bdraco
Copy link
Member

bdraco commented Aug 9, 2024

I wish there was a way to do a manual migration. That would be fine for me.

Its been posted above.

#117263 (comment)

@bdraco
Copy link
Member

bdraco commented Aug 9, 2024

Please be sure your system meets the disk space requirements before updating to 2024.8.1

https://www.home-assistant.io/integrations/recorder/#disk-space-requirements

As there are multiple issue threads going, and your comments will likely be missed otherwise, please continue at https://community.home-assistant.io/t/psa-2024-7-recorder-problems/746428

@home-assistant home-assistant locked as resolved and limited conversation to collaborators Aug 9, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet