Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Deletion in compressed chunks led to database size increase of nearly 20x #6196

Open
Oidlichtnwoada opened this issue Oct 14, 2023 · 8 comments
Assignees

Comments

@Oidlichtnwoada
Copy link

Oidlichtnwoada commented Oct 14, 2023

What type of bug is this?

Crash, Performance issue

What subsystems and features are affected?

Compression

What happened?

We have a hypertable "measurements" with 2300 6-hour chunks ... it had 25GB compressed (with hypertable_size, whole database used 30GB), we ran a delete statement for measurements to delete measurements of deleted projects in that hypertable ... this took our whole cluster down as suddenly after executing the delete call the whole database had a size of 330GB in longhorn (the size was mostly the hypertable, as queried with hypertable_size) and it was still increasing, but the HDD space of all nodes was full ... this was effectively killing our whole Kubernetes cluster as each volume replica in Longhorn filled up the HDD of the underlying node (not even the volume size limit in Longhorn was working) ... could you please mention in the docs that during delete the database size will rapidly increase as it seems that every chunk is uncompressing itself? I am not sure if that is intended, it would be more cleverly to uncompress, delete data, and recompress chunks in batches and not to uncompress everything leading to a massive HDD usage ...

timescale/timescaledb-ha:pg14.9-ts2.11.2-all, we use the docker version with a Longhorn volume ...

Furthermore, once chunks are uncompressed, longhorn cannot reclaim the unused HDD disk space, even if VACUUM is executed ... is a full vacuum call with downtime necessary? After the uncompressing the size stays large in Longhorn and even after a filesystem trim, nothing can be reclaimed in Longhorn ... there must be some errors in the PG/TimescaleDB implementation ...

Thanks in advance for the response ...

The delete statement was smth like:
DELETE FROM measurements m WHERE variable not in (SELECT id FROM other_table)"

TimescaleDB version affected

2.11.2

PostgreSQL version used

14.9

What operating system did you use?

docker

What installation method did you use?

Docker

What platform did you run on?

On prem/Self-hosted

Relevant log output and stack trace

No response

How can we reproduce the bug?

Delete data from a hypertable with compressed chunks 

The machine was 32GB, 8-core CPU

This is our config:

postgresql.conf: |
    listen_addresses = '*'
    max_connections = 192
    shared_buffers = 1989MB
    work_mem = 5093kB
    maintenance_work_mem = 1018619kB
    dynamic_shared_memory_type = posix
    effective_io_concurrency = 256
    max_worker_processes = 32
    max_parallel_workers_per_gather = 8
    max_parallel_workers = 16
    wal_buffers = 16MB
    checkpoint_completion_target = 0.9
    max_wal_size = 1GB
    min_wal_size = 512MB
    random_page_cost = 1.1
    effective_cache_size = 5968MB
    default_statistics_target = 500
    idle_in_transaction_session_timeout = 3600000
    log_timezone = 'Etc/UTC'
    autovacuum_max_workers = 10
    autovacuum_naptime = 10
    datestyle = 'iso, mdy'
    timezone = 'Etc/UTC'
    lc_messages = 'C.UTF-8'
    lc_monetary = 'C.UTF-8'
    lc_numeric = 'C.UTF-8'
    lc_time = 'C.UTF-8'
    default_text_search_config = 'pg_catalog.english'
    shared_preload_libraries = 'timescaledb'
    max_locks_per_transaction = 2048
    timescaledb.telemetry_level=basic
    timescaledb.max_background_workers = 32
    timescaledb.last_tuned = '2023-06-07T09:05:44Z'
    timescaledb.last_tuned_version = '0.14.3'
@konskov
Copy link
Contributor

konskov commented Oct 16, 2023

Hi @Oidlichtnwoada , thanks for reaching out. I understand this issue is not of your making so the extra time you spent reporting it is really appreciated.
The way updates/deletes of compressed data work is by decompressing the required rows and then performing the operation on uncompressed data. The amount of data that will be uncompressed depends on the subquery / WHERE clause, based on which we can filter the compressed data.
For example, if segmentby columns are used in the WHERE clause, then we are able to decompress only the rows with the specified segmentby values. Or if orderby values are specified then we may be able to target specific chunks within the orderby value range. But there are still cases for which we are unable to do filtering and so we can end up decompressing much more data than the update/delete operation will touch, leaving the rest uncompressed. I think that might be the case here, that we are not able to do effective filtering based on the subquery.

I think that is probably why VACUUM appears to be ineffective: if the majority of the chunk remains uncompressed, and a minor amount is deleted, then this has no noticeable effect on the chunk size which is dominated by the uncompressed data size.

Could you please share the explain plan for the delete statement you are running? That would help us understand the situation better.
Could you also share the VACUUM command you ran?

As a potential workaround to help in this situation, we would suggest to try updating and then recompressing each chunk one by one. This would keep the size reasonable, as at most one chunk would be uncompressed at a time.. Hope that helps

@djzurawski
Copy link

The way updates/deletes of compressed data work is by decompressing the required rows and then performing the operation on uncompressed data. The amount of data that will be uncompressed depends on the subquery / WHERE clause, based on which we can filter the compressed data.

This is what I've been told but I do not find this to be the case. I've been having disk space issues trying to use the new feature of directly deleting from compressed chunks too. If I do it the old manual way of manually decompressing the affected chunks, then deleting the rows, everything works as expected. Trying the automatic process resource usage goes crazy, cpu, ram, and disk. Something isnt right.

@wrobell
Copy link

wrobell commented Oct 23, 2023

Certain execution plans by Timescale might be tricky. I am struggling with something similar (I hope to raise separate ticket soon).

@djzurawski Please try to see the plan for

DELETE FROM measurements m WHERE variable not in (SELECT id FROM other_table)

vs

DELETE FROM measurements m WHERE variable not in (some, constant, values)

Is there a big difference? (Shot in the dark as the table schemas are unspecified).

@Oidlichtnwoada
Copy link
Author

Also our resource usage went completely broken, I would not have released the deletion feature from compressed chunks as there are only two user groups:

  • users that are using the feature and are frustrated
  • users that are not using the feature

@wrobell
Copy link

wrobell commented Nov 12, 2023

#6309 might be related.

@sergiunagy
Copy link

Same behavior observed.
Using official timescale/timescaledb-ha container , timescale extension v2.14.2 on PostgreSQL 16.2 (Ubuntu 16.2-1.pgdg22.04+1)

Running a Delete operation shows big jumps in RAM, Disk and Processor usage even if targeting a single sample.

@fcovatti
Copy link

Same behavior observed. Using official timescale/timescaledb-ha container , timescale extension v2.14.2 on PostgreSQL 16.2 (Ubuntu 16.2-1.pgdg22.04+1)

Running a Delete operation shows big jumps in RAM, Disk and Processor usage even if targeting a single sample.

I understand the increase in disk since the chunk needs to be decompressed, but even after complete of deletion we can experience bad performance while querying data over the chunk which data is deleted.
We also noticed chunks marked as "compressed" but with uncompressed data.

This issue seems also similar to this one #5802

Any update regarding "fixing" deletion from compressed chunks or anyone has a recommendation/workaround that would avoid running into performance degradation?

@antekresic
Copy link
Contributor

There have been recent optimizations to UPDATE/DELETE operations released with version 2.17.

Those should improve the resource usage when running such operations.

Have you had the chance to check if the situation improved since?

@antekresic antekresic self-assigned this Dec 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants